Statistical Language Models (SLMs) were a game-changer in the 1990s for natural language processing and information retrieval tasks.
The core idea behind SLMs was to predict the next word based on the most recent context, using the Markov assumption which refers to the simplification that the probability of a word occurring in a sequence depends only on the preceding n−1 words and not on any words before those. Models that used a fixed context length of n words are known as n-gram language models (like bigram and trigram models).
Understanding N-Grams
Let’s imagine we have the following vocabulary with only 5 possible words:
Vocabulary = {fox, brown, quick, jumps, the}
The unigrams (1-grams) would be the single words: fox, brown, quick, jumps, the.
The bigrams (2-grams) would be sequences of two words: fox brown, fox quick, fox jumps, etc.
Trigrams (3-grams) are the sequences of three consecutive words: fox brown quick, fox brown jumps, fox brown the, etc.
Higher-order n-grams (4-grams, 5-grams, etc.) continue this pattern but are less commonly used due to the exponential growth in computational complexity and data sparsity issues.
Predicting the Next Word
Let’s take an example, saying we want to predict the next word after The quick brown using a trigram model.
When using trigrams for prediction, the context for predicting the next word after The quick brown is quick brown because the context length should be n-1 to allow us to compute the n-grams probabilities.
During the training phase, the model saw a large corpus of text data and calculated and stored the frequencies (counts) of all 3-grams (sequences of 3 words) present in the dataset.
The frequency of each 3-gram is simply the number of times that exact sequence of words appears in the training corpus. Let's define some made-up frequencies for the trigrams and the bigram "quick brown" to illustrate how probabilities would be calculated.
quick brown fox: 42
quick brown brown: 1
quick brown quick: 2
quick brown jumps: 13
quick brown the: 3
This means that in the training set the model saw the sequence quick brown fox 42 times, quick brown brown one time, etc.
The model then calculates the probability of each possible word in the vocabulary following the context quick brown. The frequency of quick brown is the sum of the trigram frequencies, so 42+1+2+13+3 = 61.
P(fox | quick brown) = frequency(quick brown fox) / frequency(quick brown) = 42/61 = 0.69
P(brown | quick brown) = frequency(quick brown brown) / frequency(quick brown) = 1/61 = 0.02
P(quick | quick brown) = frequency(quick brown quick) / frequency(quick brown) = 2/61 = 0.03
P(jumps | quick brown) = frequency(quick brown jumps) / frequency(quick brown) = 13/61 = 0.21
P(the | quick brown) = frequency(quick brown the) / frequency(quick brown) = 3/61 = 0.05
The model then predicts the next word by selecting the word with the highest conditional probability given the context quick brown. In our case the word fox has the highest probability of following quick brown.
Now that we predicted the word fox we repeat the process using as context brown fox.
The Curse of Dimensionality
This is a small example but the number of possible n-grams grows exponentially with the size of n (the context length) and the vocabulary size. For a vocabulary of V words, there are Vⁿ possible n-grams.
For large vocabularies typical in natural language, and even for relatively small values of n, this leads to an enormous number of potential n-grams. For more specific or focused tasks, vocabulary sizes can range from 10k to 50k words. This size is often seen in tasks with a limited domain or scope, such as customer service chatbots. For tasks requiring a broad understanding of language, such as machine translation or general-purpose language models, vocabularies can range from 100k to over a million words.
This is the key challenge of SLMs: the curse of dimensionality. Accurately estimating high-order language models required an exponential number of transition probabilities to be calculated. This led to data sparsity issues.
It's fascinating to look back and see the foundations that modern language models are built upon. SLMs paved the way for many of the NLP breakthroughs we've seen in recent years!