Aug 2, 2010

Detecting influenza outbreaks by analyzing Twitter messages

Today I read an interesting paper that used Twitter data. It's a paper by Aron Culotta on detecting influenza outbreaks.
http://arxiv.org/abs/1007.4748


Summary =======
This paper examines the correlation between flu-related tweets and the actual flu trends (reported by the US centers for disease control and prevention, CDC). The author borrows the methodology used in Google's Nature paper on flu tracking (Ginsberg et al. 2009) and uses the log-odds to measure the correlation between two variables: (a) the fraction of population that have flu over several weeks, reported by CDC, and (b) the fraction of tweets that contain flu related keywords.


Findings =======
(1) Similar to the Google's Nature paper, the accuracy in prediction is pretty high---around 95%. (Google paper, based on google search keywords, reported 97% accuracy.)

(2) What is new in this work is that it investigates the need to prune out spurious words like 'swine', 'h1n1', 'vaccine', and 'http'. Tweets containing spurious words likely contain information about the flu passing, but not flu symptoms. However, removing spurious words didn't always increase the accuracy in the prediction of flu trends.

(3) Hence, the author tried out further with two different supervised learning methods to see if spurious words could be removed in a smart way. The answer is partly yes, based on one of the classification methods used.


My thoughts =======
The paper made me think about what sorts of research we are pursuing on the field of data-driven social science. Application-driven research usually has a clear goal, but it always leaves me the feeling of wanting to know more about the data.

(1) Higher accuracy really needed? After reading the first part of the paper, I was less convinced why the author went after increasing the accuracy in prediction. 95% accuracy seems high enough to be used in a surveillance system.

(2) What's next? Maybe to look at social network topology or geography? The frequency of words is one interesting statistic someone can draw from data. But the data has so much more to tell. How are the users connected -- are users who talk about flu symptoms connected? Or are they located nearby in geography?


PS: I am limiting comments only to the members of this blog, as I started getting too many spams.