What the model learns
An n-gram model counts which words follow a context in the training corpus. If the context appears many times, the prediction becomes more confident.
Build intuition for language models by training a small n-gram predictor, typing a context, and inspecting the exact probability calculation behind each suggested next word.
Current best prediction
students
2.26% estimated probability
Model context
processing helps + helps + unigram
The last words become the evidence used for prediction.
Edit the examples and watch the predictions change.
Top next words sorted by estimated probability.
students
count signal 1
2.26%
and
count signal 8
1.31%
models
count signal 6
1.15%
learning
count signal 5
1.07%
to
count signal 5
1.07%
next
count signal 4
0.99%
the
count signal 4
0.99%
word
count signal 4
0.99%
An n-gram model counts which words follow a context in the training corpus. If the context appears many times, the prediction becomes more confident.
Small corpora miss many possible word combinations. Laplace smoothing gives every vocabulary word a small chance instead of treating unseen words as impossible.
Modern transformers do not only count nearby words. They learn embeddings and attention patterns so much longer context can influence the next token.