FAST: Predictive Web Analytics: How does it work?
The FAST (Forecast and Analytics of Social Media and Traffic) platform is based on research by Carlos Castillo (QCRI), Mohammed El-Haddad (Al Jazeera), Jürgen Pfeffer (CMU), and Matt Stempeck (MIT). The software has been developed at QCRI by Carlos Castillo and Kiran Garimella.
Predicting user behavior online is a well-established research topic, and in our paper we acknowledge at least 10 previous works that have done other types of predictions on the web including number of comments, tweets, votes, links, etc.
We focus specifically on a relevant problem for a news website: predict 3-day pageviews shortly after a news article is posted on the web. Our system works through a series of steps:
Second, every time a web page from a site makes it into the 30 most visited on a 5-minute window, we launch a separate process that periodically asks Twitter and Facebook for information about each page. Tweets are further processed to measure their information content (entropy) as well as counting the number of unique messages.
Third, all the data is stored using an efficient Cassandra NoSQL database, which keeps information at different time resolutions (1 minute, 5 minutes, and 1 hour).
Finally, we collect all the information from articles older than 3 days and create two linear models — one for news articles and one for other types of articles, such as editorials or features. Pageviews after 3 days are modeled as a function of the signals available after 1 hour, 6 hours, 12 hours, and 24 hours since publications. These models are executed periodically on all new articles that have passed the respective threshold of age, e.g. all articles having at least 1 hour of age for the 1-hour model.
How good are the predictions?
Most new articles exhibit fairly predictable trajectories, almost like a ballistic trajectory, with visits per minute going up and then down following a smooth curve. However, not all articles start with the same speed or generate the same reaction in the audience.
The accuracy of the prediction improves as time passes. Naturally, the more we wait for an article to accumulate pageviews and social media reactions, the better the prediction quality. At the same time, the value of such predictions decreases with time. There is a sweet spot between having early (but less accurate) predictions and having late (but more accurate) predictions.
In our platform, that sweet spot is somewhere between 1 and 6 hours. The majority of news articles have a fairly predictable behavior in which visits slow down rather quickly. For those articles, the predictions after 1 hour already provide valuable hints about whether the article will be a high-traffic one. After 6 hours, we have a clear picture of the ordering of articles and rarely are more than 50% off the mark for high-traffic articles.
What happens if reader habits change?
Our system continuously learns using the input from new articles, and over time the weight of older articles in the algorithm goes to zero.
What happens if a slow-burning story catches fire later?
Predictions are revised every few hours to incorporate new events, such as having a link from a high-traffic web page, or having a new cascade of activity in social media.
Does this system replace the editor's work?
Absolutely not. Editors should rely on their own knowledge and instincts. Editors that additionally take into account the shifts in interest of their visitors may take better, more informed, editorial decisions.
This provides guest access for testing purposes to predictions using a sample of data from Al Jazeera English. Note that some information is only visible to logged-in users.
For inquiries, including how we can predict traffic to your website, please contact Carlos Castillo: