Wikipedia, Two Years Later

Follow up to: Three Millionth Wikipedia Article Followup (Now With Theory!)

Back in 2009, I attempted to predict when the three millionth article of Wikipedia would be released. My prediction turned out to be rather accurate — I had predicted August 14, when the real date was August 17.

Most interestingly, the date was extrapolated three months into the future by just looking at the current growth rate of Wikipedia as if it was linear, assuming a constant growth of 1328 articles a day. This lead me to theorize that Wikipedia had actually achieved a linear growth rate.

It turns out I spoke too soon — if I had continued that growth rate to today, exactly two years after I first predicted the three millionth article, Wikipedia would have 3,848,085 articles. As of the time I wrote this blog post, Wikipedia only has 3,638,071 articles — we’re a full 210,014 articles short. Clearly the linear model doesn’t work for analyzing Wikipedia’s long term growth.

So if the linear model is inaccurate, how can we model Wikipedia’s growth? And how can we use this model to predict the much heralded four millionth article? I’m going to find out.

 

Looking at the Data

I’m going to take a different approach than I have in the past — instead of simply theorizing using a limited amount of data from this week, I’ve actually found a data set of the average growth rate (among other things) for every month that Wikipedia has been operational. Here is a graph of the total amount of articles on wikipedia at a given time. (Time ranges from January 2001 to March 2011, and the weird formatting/stepping toward the end is the result of intermittent data, not an actual trend in the data itself):

The total amount of articles in Wikipedia during a given month.

 

It’s not really clear how fast Wikipedia is growing, so here’s a graph with the same time-scale, except we graph the rate of change — the number of articles added to Wikipedia per day — as a scatterplot:

A scatterplot graph of the average articles added per day in Wikipedia (the derivative of the total amount of articles)

 

We can see that while the growth rate is experiencing a wide amount of variance and hasn’t been strongly correlated in one direction recently, overall the growth rate is decreasing. However, we can also see some other trends.

The start of Wikipedia requires some time to gain appeal, which would explain an initially slow rise, followed by a sharp increase upward as traffic generates content, which generates more traffic, etc.

We also see two outliers: a slowdown in the middle of 2002 was caused by by major server performance problems, remedied by extensive work on the software, and a huge jump in October 2002 when an auto-posting robot, Rambot, created roughly 30,000 articles on US towns and cities that were generated from a database.

 

Not counting Rambot, the true maximum rate of article creation was achieved around August 2006, when Wikipedia hit a record 2400 new articles per day. Since then, this rate has periodically fallen, perhaps because as more and more content is generated, the amount of potential content decreases.

However, how long can we expect the growth rate to continue to decrease, and at what rate can we expect this growth rate to decrease (this is the second derivative of number of articles; the change in the change)?

 

Considering a Linear Model

I know I said the linear model was inaccurate, but we still should see how it fits the data to get a comparison when using future models. Running a linear regression on the growth rate of average articles per day produces y = 12.898x + 169.47 (y is average articles per day and x is time, in months) with a R^2 value of 0.5406, which is a pretty bad fit:

 

If we run a completely unrelated linear regression on the total articles, however, we get y = 33864x – 766633 with an R^2 of 0.9381:

 

Linear Conclusions

The linear model suggests that Wikipedia is growing at the rate 33864 articles per month, or 1110 articles per day. This is lower / more pessimistic than the model I suggested earlier, which was 1328 articles per day. This model would predict the 4 millionth article on April 7, 2012. However, the model does not acknowledge that Wikipedia’s growth is in the process of slowing down, which would put the date of the 4 millionth article even further.

 

Considering a Quadratic Model

So if a linear model doesn’t work, what happens if we bump it up one — to the quadratic model? What if the growth rate of Wikipedia was simply quadratic, with the amount of articles per day rising, and then simply falling back, eventually hitting 0?

Fitting a quadratic model produces y = -0.273x^2 + 46.747x – 535.73 with an R^2 value of 0.7848:

 

However, a completely different quadratic model on the total number of articles seems to fit rather well, with y = 241.64x^2 + 3900.2x – 142389 producing an R^2 value of 0.9863:

 

Quadratic Conclusions

Using the first quadratic model (articles per day) would predict the “death” of Wikipedia (the point where article growth rate is 0) on March 28, 2014, with 3,658,548 articles.

Using the second quadratic model (total articles), which predicts Wikipedia continuing to grow forever, would get the four millionth article on April 3, 2011. Clearly that’s provably too optimistic, since it’s already past April 3 and we don’t have four million articles.

 

Clearly, however, the quadratic model does not put up a good showing, since we can assume the constant flow of history would make it so there is always something to add to Wikipedia — the growth rate should never hit 0, even it comes close. I expect the “fall” of Wikipedia to end in the dozens range, at worst.

On the flip side, if the first model was too pessimistic, the second one was wildly optimistic — despite what the model says, Wikipedia’s growth is indeed slowing. We won’t hit the four millionth article for awhile, even if we would normally have expected it by now.

 

Considering an Ad Hoc Polynomial Model

An interesting thing to do is that, considering the quadratic model fit somewhat well, what if we looked for a polynomial model that happened to fit our current data set? While there’s no guarantee it would extrapolate at all, it could potentially have predictive merit.

For the average growth per day, we get a model y = 0.00015x^4 – 0.04397x^3 + 3.9893x^2 – 103.6x + 834.426 with an R^2 value of 0.8985:

 

And for the total number of articles, we get a model of y = 0.0009x^5 – 0.3353x^4 + 40.558x^3 – 1579.9x^2 + 25450x – 89065 with an R^2 value of 0.9994:

 

Polynomial Conclusions

Using this model, we predict the four millionth article to fall on January 26, 2014. However, extrapolating this model for long term growth reveals an assumption that the growth will shoot back up again, with an expectation of growth reaching 2600 articles/per day in July 2013. Fitting a model to the data isn’t always a good strategy, because it will only be able to accurately predict the existing data — it may not be ripe for extrapolating.

 

Considering a Split Model

Another interesting thing we notice with the data is that Wikipedia growth seems to be in two very distinct phases — from January 2001 to August 2006, growth appears exponentially upward, and then from September 2006 to today, the growth immediately shifts to be a form of exponential decay. We can then model the growth in articles per day as two separate models: y = (1.1667)x^1.6818 models Jan 2001 – Aug 2006 (x from 0 to 68) with an R^2 of 0.919 and y = (132196)x^-1.0025 models Sep 2006 – Mar 2011 (x from 69 to 123) with an R^2 of 0.6642:

 

Split Model Conclusions

Applying this model to the total articles a day produce equations y = (31.656)x^2.4668 with an R^2 of 0.9967 and y = 33471x^0.9692 with an R^2 of 0.9438.

This places the 4 millionth article at August 21, 2012. It also suggests that growth in Wikipedia will continue at a reasonable rate while gradually slowing down. The articles per day will drop below 1000 on January 5, 2012; will drop below 500 per day around October 2022; and drop below 100 per day around August 2119. This decay is still simply a guess, but it seems rather reasonable.

 

Rationale for the Split

Lastly, the split does seem to be an ostensibly random point in the data — why shift there, instead of somewhere else? It turns out that it might have an explanation — August 2006 is the exact same month that Jimmy Wales, the founder of Wikipedia held Wikimania 2006, when he stated that Wikipedia has achieved sufficient volume and called for an emphasis on quality, specifically 100,000 feature-quality articles.

This shift may have lead those driving Wikipedia’s growth to focus instead on editing more per article rather than making more articles. This also has some basis in the data, with the average amount of edits per article increasing further after August 2006:

The red line indicates August 2006, when the shift in article growth occurred.

 

Considering a Gompertz Model

One last consideration is a Gompertz Model which is in the form of y = we^(xe^(zt)), where e is the constant, t is the time variable, and w, x, and z are three constants that change the form of the curve. In general, the Gompertz model is similar to a logistic model, except the maximum growth is reached much more slowly.

Fitting a Gompertz model on the growth in articles per day produces y = 4378449e^(-15.42677e^(-0.384124x)) where y is the total amount of articles and x is time in years:

 

And here is that applied to articles per day:

 

Gompertz Conclusions

The Gompertz model has projected the four millionth article on May 19, 2013, and an eventual “death” of Wikipedia at December 7, 2046 with 4,378,449 articles.

 

Predicting the 4 Millionth Article

We have a variety of predictions as to when the 4 millionth article will hit Wikipedia:

  • Quadratic Model: April 3, 2011
  • Linear Model: April 7, 2012
  • Split Model: August 21, 2012
  • Gompertz Model: May 19, 2013
  • Polynomial Model: January 26, 2014

 

But with dates spanning over four different years, which one could we claim as accurate? When can we actually expect the four millionth article to be added? When we graph all of the models side-by-side, we can see they all come to rather different assumptions about future growth:

From top to bottom: gold is polynomial model; light purple is quadratic model; dark purple is linear model; lighter green is split exponential model; and the lighter blue is gompertz model. y is total number of articles, and x is months since beginning

 

Factors to Control Growth

As one would expect, making extrapolations is a capricious enterprise — any sudden change in the Wikipedia status quo could change article growth, such as the first one in Wikimania 2006 that shifted article growth away from quality (number of articles) and too quality (edits per article).

Growth rates of articles are also tied to Wikipedia policies for deletion and notability. Making it (1) more easy to delete underpreforming articles or (2) more difficult for an article to be important enough would reduce the growth rate considerably, and vice versa. Furthermore, a big change such as preventing editing from people who are not logged in under an account, could also greatly decrease the growth rate. This means that while it is easy to fit a model to the current data, it might not be easy to extrapolate from these models.

While it’s rather unrealistic to expect Wikipedia’s growth to shoot back up exponentially, it’s also difficult to imagine the growth rate of Wikipedia ever being 0, or even less than 300. There will always be new historical events to write articles about as current events progress, new television shows to write about, etc. Furthermore, a focus on quality will still result in current articles spilling into additional articles to focus on individual facets of the topic. It does seem that the growth will eventually be linear, or very close to it.

Lastly, it’s not impossible that something could happen to Wikipedia to shut it down prematurely. In the next decade, many things could happen — by 2021, Wikipedia will be twice as old as it is now. Perhaps the company could lose donations and be unable to afford hosting, or anything else unfortunate, and Wikipedia is forced to freeze before reaching the four millionth article.

 

Making the Call

So while that’s all well and good, when can we expect the four millionth article? Right now, I think the split exponential model and the Gompertz model are the most trustworthy and reasonable for extrapolation, and if I had to pick one, I would go with the exponential model. Therefore, I would say the 4 millionth article will be on August 21, 2012, but could occur as late as May 19, 2013.

-

I now blog at EverydayUtilitarian.com. I hope you'll join me at my new blog! This page has been left as an archive.

On 18 May 2011 in All, Mathematics. 2 Comments.

2 Comments

  1. #1 Robert Moore says:
    13 Jul 2012, 7:13 pm  

    Fucking idiot. We reached the 4 millionth article today, July 13th 2012. I didn’t read 95% of your mathematical baloney bullshit but you were simply wrong. Had you not wasted 5 hours writing up this shitty blog you would have saved a lot of time.

  2. #2 Peter Hurford (author) says:
    13 Jul 2012, 7:26 pm  

    Lol, thanks Rob. I do plan to follow this up pretty soon, though.

Leave a Reply

Comment HTML: You can use HTML in comments. I recommend <blockquote>Quote</blockquote> for quoting what others have said. <b>Text</b> is for bold, <i>Text</i> is for italic, and <a href="url">text</a> is for making links.