Twitter's Anomaly Detection Tool

Twitter announced their R based "Anomaly Detection tool" in early January. The goal was to detect unusual amounts of traffic in their network such as "bots" or "spam" or due to a new software release.  It got a lot of attention so we downloaded it into R and ran it.  We ran it in Autobox and said "we win".  

Yahoo announced that they are also trying to monitor network traffic, fraud/intrusion and identify anomalies using their own and not available algorithm.  They posted an anomaly database that is accessible to only students to access.  This made us want to revisit the Twitter example and post our results vs. Twitter and of course test out the Yahoo data(this will be another post when we get access to the data).

The Twitter developers results may have been an improvement, but we feel that they aren't as robust as they could be.  The topic of what is a an outlier is always debatable and also false positives are also something to consider as well.

There are 14,398 in the original dataset. We analyzed the first 8,560 for this analysis and here are the results from Twitter's R package as there is no reason to analyze that large of dataset.  If you are trying to indentify outliers and flag the system as it is out of control then you would want to respond when you find them and not 10,000 observations later on.

Twitter never explains what seasonality the data is in it's press release or elsewhere, but it is obviously minute level data due tothe 60 cycle.

The Twitter tool found that almost 1% of the observations were outliers.



 The blue circles are where outliers were flagged.




Here are the results from Autobox and the P's annotated on the graph are where outliers were found.  Both analysis used .05 alpha for identifying the outliers, but Autobox found many more.  








Truth be told, Autobox's array's are set to max out at 10,000 observations, but we could expand these if necessary.  In this case, we couldn't analyze all of the data, but the picture is clear that Autobox has a more robust approach using the same alpha level =.05 for identifying outliers.

Go to top