Predicting the Three Millionth Article

Monday, May 18, 2009

As many of you know, there’s this awesome database of knowledge called Wikipedia that contains information on everything from President Barack Obama, to the Spanish Inquisition, to a comprehensive list of bow-tie wearers. (Seriously.)

And if you’re not one of the four people in the world living under rocks (with wi-fi) that have never heard of Wikipedia, you should know that the database is quickly reaching three million articles. As of this paragraph, the encyclopedia currently only has a mere 121,355 articles to go.

Of course, if there’s one thing Wikipedia does better than provide information, it’s provide information that can be updated to say anything for periods of twenty minutes at a time to fool your friends. Of course, that has nothing to do with this current article. What I meant to say is that if there’s one thing Wikipedia does better than provide information, it’s that it has so much metadata (information about itself) than pretty much anything publicly accessible.

Look at all those pages. Seriously, it’s like they’re narcissistic or something.

 

So how do we go about predicting the three millionth article? We know based on milestones that the previous records were achieved:


100,000: January 20, 2003 (Hastings, New Zealand)
200,000: February 1, 2004 (Neil Warnock)
500,000: March 17, 2005 (Forced settlements in the Soviet Union)
1,000,000: March 1, 2006 (Jordanhill railway station)
1,500,000: November 24, 2006 (Kanab ambersnail)
2,000,000: September 9, 2007 (El Hormiguero)
2,500,000: August 11, 2008 (Joe Connor)
3,000,000: ?

 

In order to first find out how long to wait until we can expect the Big 3M, you first have to decide how to model the growth. At first it was believed that the growth was an exponential function, but apparently people are running out of things to write about.

 

From this chart, the growth of articles seems mostly linear, albeit steadily slowing, which is indicative of a logistic curve.

So when we plot out all of the current available data points and then slap a badly drawn logistic curve on to it, we get an estimated date of late August to early November for the fabled three millionth article.

Click to enlarge.

 

Using article information gathered over only the past couple three days, Wikipedia currently has an average growth of 1328 articles. Using this highly inconsistent data with an extremely poor sample size, we can pinpoint the date that 3M articles will be surpassed to be 92 days from now. This places the estimate at August 14, 2009. So between mid-August or mid-November, the 3M article should be definitely this year.

However, based on the logistic curve, it seems that for article #4 million, we may have to wait another three years or more. It should be noted though, that unlike a real logistic curve, Wikipedia will never run out of content as current events and new content development outside the Wiki will always allow for new content. Additionally, Wikipedia could lower their standards for article submission and allow articles for less noteworthy topics.

I can only assume that eventually the growth will flatten out and become, more or less, linear — as it has been for the past month or so.

Followed up by: Three Millionth Wikipedia Article Followup (Now With Theory!)

Be Sociable, Share!

 

Liked this Essay?

Leave a Reply

Comment HTML: You can use HTML in comments. I reccomend <blockquote>Quote</blockquote> for quoting what others have said. <b>Text</b> is for bold, <i>Text</i> is for italic, and <a href="url">text</a> is for making links.