A simple, spoiler-free analysis of letter frequency in Wordle
The Wordle dictionary differs from a standard English dictionary in some significant ways
Unless you live under a rock you’ve heard of the word guessing game Wordle, or at least seen those weird color-coded grids popping up on your social media feeds. In brief, the game gives you six chances to guess the day’s five letter word, with a structure similar to the old board game Mastermind — you guess a word, and the game tells you whether the letters in your word are in the solution word, as well as whether they’re in the correct place.
Its appeal lies in its simplicity, as well as its creator’s utter refusal to monetize the game’s popularity with spammy ads or shitty apps or any of the other trappings of the modern attention economy. Every day — and once a day only — millions of people go to a bare-bones HTML website and play a game.
There have been 221 Wordles so far, and the game’s creator, a guy named Josh Wardle, has loaded up the website with enough words to generate daily puzzles through the end of the decade. 221 words is a large enough dataset that we can start to run some simple analyses to answer questions like “Which letters appear most often in Wordle?” and “From a letter frequency standpoint, are Wordle words distinct from the broader corpus of all English words?” Let’s take a look.
For starters, here is the frequency of letters that have appeared in all 221 Wordle solutions so far. Not surprisingly vowels dominate the top of the list, along with common consonants like R and T. If you’re looking to simply brute-force the game with statistics — similar to the way Wheel of Fortune finalists always pick the letters R, S, T, L, N and E — this suggests starting out with something like ORATE as your first guess is a good way to go.
But the really fun and interesting thing is that this distribution differs from the distribution of letters in the entire English language in some significant ways. Take a look.
Those thinner, darker bars are the entire language distribution, ultimately derived from the Concise Oxford Dictionary for a cryptography class at Notre Dame (it’s worth pointing out that doing this with different dictionaries may yield marginally different percentages depending on their word lists, to the tune of a tenth of a percentage point or two in either direction. But nothing that would change the overall picture here). If the thinner bar is taller, it means the letter appears more frequently in the dictionary than in Wordle. Conversely, shorter thin bars indicate letters that are disproportionately likely to be in Wordle solutions.
E is slightly under-represented in Wordle relative to the English language, for instance. But check out letters like I and N — they’re quite a bit less likely to appear in Wordle than in the wild. N makes up nearly 7 percent of dictionary letters, but a hair under 4 percent of Wordle answers.
On the other hand, letters like B, Y and G are considerably more likely to show up in Wordles (this, incidentally, screwed me hard the other day when the solution was PROXY. I had the first four letters nailed down, but somehow never figured out that the last one was Y).
There are a couple things driving those differences. The most obvious is that five-letter words are going to have different structures than shorter or longer ones, leading to different patterns in letter usage. But the other really important thing to note is that the Wordle solutions aren’t just a random sample of all possible five-letter words. If they were, sometimes the answers would be really weird and obscure things like COXAE (an anatomical term for a hip joints), GLODE (archaic past tense of GLIDE) or SYVER (a type of street drain). In fact, those weirdo words from the long tail of the language would end up being the answer most of the time, given their sheer number relative to the smaller number of words most people use daily.
Correctly foreseeing that nobody would want to play a guessing game involving obscurities that regular people have never heard of, the creator of Wordle, in his infinite kindness and wisdom, whittled down the list of 12,000+ five-letter words in the English language to a subset of a couple thousand much more common ones. We know this, incidentally, because the internal Wordle dictionary is viewable in the source code of the game’s website.
You could probably take all of this even further, analyzing actual letter placement to see, for instance, if Es in Wordle are more or less likely to appear at the start or end of a word than they are in the English language. Again, that falls into the realm of brute-forcing the game with statistics. I think it’s fun to do a little bit of that — hence this post — but a big part of the joy of Wordle is simply following your instincts and seeing where your guesses take you.
"We know this, incidentally, because the internal Wordle dictionary is viewable in the source code of the game’s website."
Christopher, you sly dog, you've hacked Wordle!