Khan Redux
You might remember earlier this year, Kevin Martin’s post about how many a’s people put in Khan. He also mentioned that one might fit an equation to the curve.
To a geeky statistician, those are dangerous words. Dangerously appealing words.
Before you continue, let me warn you: extreme geekitude follows; performing some analysis of this was like bringing an elephant gun to a squirrel hunt. A very geeky squirrel hunt (perhaps squirrel fishing). If you’d just like to see a graph of the final model, feel free to skip to the end.
So, the first thing we do is try to come up with a model for this curve. The basic idea is this: every time someone puts up a web page mentioning Kirk’s Khan scream, they have some number of a’s which they’re going to use. We consider that everyone has some number of a’s they tend to feel is appropriate, and that we are selecting from the population of people who put Khan scream references on the web. So we are modeling some underlying distribution of preference for a’s among these people.
Footnote: I also have to recognize that in addition to a distribution of preference over people, an individual person has some variation in how many a’s they actually put up; that there may be multiple populations of people; and that different kinds of people are more likely to add references to Khan on the web. Some may even post multiple times. While a more complex model which took this into account might be able to make a better fit to the data, we simply consider it all as combined into a single conditional distribution–given that the post was made, what is the probability of it including a certain number of a’s.
The first model is pretty basic: it says that after each ‘a’ is added, there’s a chance that you’ll stop, add an ‘n’, and be done. This probability is the same after each a–it’s not dependent on how many you’ve entered before. This results in the number of a’s being expected to follow a geometric distribution: each ‘a’ entered is a trial, and we continue adding a’s until we ‘succeed’ and add an ‘n’. On a log scale, this model is a straight line.
After seeing this (and a few other models), and doing a little web research, we remove the two leftmost points from the data for our model. These are ‘Khan’ and ‘Khaan’ (1 and 2 a’s). They are much higher than the rest, and substantially change the model. We suspect that their references are largely due to very different sources: anyone referring to Khan Noonien Singh himself (or Gengis Khan, or any other Khan) for the first, and anyone referring to Khaan (an actual animal and also a common alternative transliteration of Khan) for the second.
After we do this, we can see an improved fit, though there are clearly still some regions of higher- or lower-than-expected occurrences.
So we now make our model a bit more complex, reflecting in part the complexity discussed above. We make a mixed model, suggesting that there are two populations posting Khan references. One follows the geometric model we used above; but the other, we will model as a negative binomial distribution: one explanation is that these are people who are aiming for a large number of a’s, and we are modeling their variation in what they think of as “a large number of a’s”. Fitting this mixed model (using maximum likelihood to determine how many people fall into each group, and the distribution parameters for each group) gives us the next graph.
A more complex model would attempt to model the conditional probability of adding another a (given how many a’s have already been added) as varying smoothly, depending on the number of a’s already added…we could, of course, model this as some sort of generalized additive model…sorry, please excuse my drool. Let’s continue.
Of course, I had to take it another couple of steps further. When I started this project, I wrote a perl script which would go to google each day and save the number of Google results for each search, stored in a file by date. Further, I extended the range to 125 a’s (anything longer than this, Google considers too long). So what we now have is a time series: for each day, we have an entire graph of values. Using this, I was hoping to see how the numbers change over time. Unfortunately, it appears that the results are not consistent over time, having significant variance up or down. Presumably, this is a result of Google trying out different variants on what results to return. But it means that rather than seeing counts increase over time, we see some variance in each count. For example, the counts for “khaaaaaaaaaaaaaaaaaaaaaaaaaan” (26 a’s) vary from around 150 to around 8000.
You can see the variance overall by looking at a boxplot of the ranges for each number. For some reason, there’s a lot of variance for 5-34 A’s, but not too much outside of that range.
So, time series analysis is pretty much out; this is a shame, because you can pretty easily make a video of the counts on each day, over time (with a fitted model for each day). The trouble is that the counts are more affected by the algorithmic decisions google is making behind the scenes than by any underlying change in the number of pages.
But we can at least try to use this variance to see if it smooths out any of our earlier outliers. Here, we’ll take the median reported values, over time, for each number of a’s (rather than the individual reported numbers on any specific day) and repeat the earlier geometric/negative binomial mixed model:
Final model: Mixed Geometric/Negative Binomial on median counts
And that, I think, actually looks like a pretty decent fit. Notice that the negative binomial portion is actually fitting the low-A section now, rather than the strange middle-A hump we saw before; this seems to give a more natural interpretation: most people will put in around 6 A’s for KHAAAAAAN!, and for people stretching longer, a geometric distribution fits pretty well for determining how long they’ll keep adding A’s.
So there you go. Proof that anything can be overanalyzed. If people like this (drop a comment here or email me), I’ll keep collecting data and will look at doing some additional analysis with more data in a few months.
You can download the perl and R code and khan data from thomaslotze.com. While this was inspired directly by Kevin Martin’s post referencing squidnews, there were also earlier graphs from drtofu, Walrus, and Jim Finnis.
Nice job!
Not to nitpick too much — but the tail is way wrong and that is interesting. It’s faddish to note that distributions related to human events tend to have long tails, but the discussion of what are the correct distributions to use in modeling them has yet to reach a wide audience, or, really, me.
And by ‘long’ tails I really mean ‘fat’ tails, though, for appropriate definitions these are the same.
I’m not sure I’d say /way/ wrong–but I definitely agree with your sense that there is something a little more subtle than strictly geometric going on in the tail (I should account for the censoring above 125, and maybe use a mixture of multiple geometrics, though I’ll point out that Yule-Simon is overly-restrictive; but basically find a reasonable way to allow for some more curvature in the tail as more ‘hard-core’ people stay in longer). Of course, with any model, there’s a question of how well you need to account for all the factors (usually depending on what you’re using it for). There will always be subtleties which are /not/ included in the model, and the question is which ones are important. As Box said, “All models are wrong, but some are useful.”
That was awesome. Way to take it to a whole new level. KBM