Aug 2, 2010

Detecting influenza outbreaks by analyzing Twitter messages

Today I read an interesting paper that used Twitter data. It's a paper by Aron Culotta on detecting influenza outbreaks.

Summary =======
This paper examines the correlation between flu-related tweets and the actual flu trends (reported by the US centers for disease control and prevention, CDC). The author borrows the methodology used in Google's Nature paper on flu tracking (Ginsberg et al. 2009) and uses the log-odds to measure the correlation between two variables: (a) the fraction of population that have flu over several weeks, reported by CDC, and (b) the fraction of tweets that contain flu related keywords.

Findings =======
(1) Similar to the Google's Nature paper, the accuracy in prediction is pretty high---around 95%. (Google paper, based on google search keywords, reported 97% accuracy.)

(2) What is new in this work is that it investigates the need to prune out spurious words like 'swine', 'h1n1', 'vaccine', and 'http'. Tweets containing spurious words likely contain information about the flu passing, but not flu symptoms. However, removing spurious words didn't always increase the accuracy in the prediction of flu trends.

(3) Hence, the author tried out further with two different supervised learning methods to see if spurious words could be removed in a smart way. The answer is partly yes, based on one of the classification methods used.

My thoughts =======
The paper made me think about what sorts of research we are pursuing on the field of data-driven social science. Application-driven research usually has a clear goal, but it always leaves me the feeling of wanting to know more about the data.

(1) Higher accuracy really needed? After reading the first part of the paper, I was less convinced why the author went after increasing the accuracy in prediction. 95% accuracy seems high enough to be used in a surveillance system.

(2) What's next? Maybe to look at social network topology or geography? The frequency of words is one interesting statistic someone can draw from data. But the data has so much more to tell. How are the users connected -- are users who talk about flu symptoms connected? Or are they located nearby in geography?

PS: I am limiting comments only to the members of this blog, as I started getting too many spams.

Apr 27, 2010

The fragility of interdependency

Today, at AALab, we had an interesting discussion about a Nature paper that was published this month: Catastrophic cascade of failures in interdependent networks by Sergey V. Buldyrev, Roni Parshani, Gerald Paul, H. Eugene Stanley  &  Shlomo Havlin

The paper is about surprising characteristics of a failure on two mutually dependent networks. While most existing studies on network robustness have focused on a single network, this doesn't mean that most real networks are independent of each other. Real networks are often mutually dependent, like the power network and the Internet described in the paper. A router cannot function if a nearby power station is out; a power station cannot function if it is disconnected (i.e., if Internet fails). Logical networks like those of financial and political networks are also mutually dependent. Some networks even share the same entities; we belong to multiple social networks like Facebook and Twitter. And when networks are mutually dependent, a single local failure could trigger a disruptive avalanche of cascading and escalating failures. 

Here are three surprising results from the paper:
(Summarized well in Vespignani's article)

1. Networks exhibit a critical threshold value for the fraction of nodes that can be removed above which the network becomes totally fragmented (i.e., the size of the giant component becomes 0). Compared to a single independent network, mutually dependent networks have a much smaller threshold value. This means that mutually dependent networks could collapse at a smaller level of damage.

2. In independent networks, increasing random failures will gradually harm the integrity of the network. However, in mutually dependent networks, increasing random failures lead to abrupt collapse of the network as shown in the graph below. G means the fraction of nodes that belong to the giant connected component. Failures in mutually dependent networks cause a step-like first-order jump around q_c.  

3. In independent networks, heavy-tailed degree distributions have been proven to add great robustness under random failures. Power-law graphs are known more robust to random failures than Erdos-Renyi random graphs. This, however, is also not true in mutually dependent networks. Power-law graphs are more fragile when they are mutually connected. The broader the degree distribution is, the more fragile the networks are.

Jan 12, 2010

A Nature paper on human insurgency

Common ecology quantifies human insurgency
by Bohorquez, Gourley, Dixon, Spagat, and Johnson
(CEIBA complex systems research center in Columbia, etc)

The paper presents an empirical data analysis of 9 major insurgencies, consisting of 54,679 events worldwide. The paper centers around the theme of finding a common factor across different insurgent conflicts and proposes a model of conflict organization, which treats insurgent population as an ecology of dynamically evolving, decision-making groups. Although groups are heterogeneous in terms of their strategies, they tend to converge towards similar response when fed the same information.

Finding (a) Insurgent conflicts have different characteristics from the traditional wars: their scaling factor is around 2.5 and typically show a power-law trend (rejects log-normality), while traditional wars cannot reject log-normality and show a smaller scaling factor. The implication of this scaling factor (2.5) is at its robustness. (It's hard to fragment such insurgent group.)
Finding (b) Burstiness in terms of the number of events per day shows abundance of heavy and light days, or conversely, lack of 'medium' days. The implication of this burstiness is at maximizing the media attention. (Modern insurgent conflicts are not about killing others, but about media show. It's better to attack on a quiet day)

Model assumes two mechanisms: (a) on-going group dynamics within the insurgent population (coalescence and fragmentation of members to different groups over time) (b) group decision making about when to attack

Personal note: I read this paper twice. At first, the model seemed too simple and I wasn't really impressed by the work. Given the general theme of the paper, I expected there to be lots of sophisticated ideas and theories that people have looked at. The model in the paper, however, was extremely simple. After a second read, I starting seeing how a simple model could link together many different empirical patterns the paper showed (e.g., why insurgencies occur around a scaling factor of 2.5, why the distribution of Columbian and Afghanistan insurgencies have different tail shapes.)

Among the tricks I've learned from the paper, I liked the frequency distribution of real wars against random wars. 10,000 random wars were generated such that dates of the events were shuffled and the average frequency was taken across the samples. Authors then compare the frequency distribution of actual data with this randomly shuffled wars. I also liked the link between this work and the same patterns shown in financial data (how people herd around different stock items each day) -- it might really show a crucial link between violent and non-violent forms of human behavior.

Now moving forward, how could we utilize the result of this work to other social network data like Twitter? Insurgency, rallies, and demonstration -- all seem difficult in terms of parsing data. But if we can do anything, what would I want to look at? Adopting methodology is an obvious way, but could we do something cooler?

Jan 11, 2010

Decentralized Search Algorithms

Complex network research focuses on large-scale network structures such as social systems, cell biology, neurology, etc. The goal is to bserve real-world network properties and model the observed properties under random mechanisms.

Erdos-Renyi (1959) random graph G(n,p) connects two pairs of nodes with a probability p=c/n where c is a constant. It has two important properties:  (a) when c is less than 1, composed of small components all of which have O(logn)  (b) when c is greater than 1, a.a.s. (asymptotically almost surely), a unique giant component containing theta(n) nodes appears, which is called the phase transition.

Stanley Milgram's 6-degree separation states that not only there exist a short path in a society, it is possible for individuals to find such a path using only local information.

Small world networks by Watts and Strongatz uses grids (to guarantee clusters) and makes short cuts in the network uniformly random. This leads to preserving small-world properties (i.e., short distance and high clustering). The prevailing need for this model is that Erdos-Renyi random graph does not support the notion of clusters.

Extended model by Kleinberg is a variant of the small-world model. Rather than making shortcuts that are uniformly random, Kleinberg assumes shortcuts follow a particular distance distribution, rho(v,w)^-alpha, and showed that  (a) alpha=2, decentralized algorithm exists that is of O(log^2 n).  (b) Otherwise, there does not exist any poly-logarithmic decentralized algorithm.

Further readings:
1. J. Kleinberg. Navigation in a Small World. Nature 406(2000), 845.
2. J. Kleinberg. Complex Networks and Decentralized Search Algorithms. Proceedings of the International Congress of Mathematicians (ICM), 2006.

Winter School on Algorithms and Combinatorics

Attending a very interesting winter school organized by KAIST professors.  It's refreshing to brush up on theoretical concepts.

(1) Talk by Heekap Ahn (Postech) on Voronoi Diagram
Computational efficiency assumes that basic operations (+,-,*,assign,etc) take constant time O(1) and tries to count, given an input size n, how many basic operations is needed.

Convex hull is a set S of points in the plane that is the smallest convex set containing S: A naive algorithm will incur n^3 (n^2 to pick two points and see if it's at the edge) , but a smart algorithm will scan the network in clockwise direction (called the plane sweeping technique) and solve in nlogn (sort by x-axis) + 2n (clockwise step)

Line segment intersection has the worst case of n^2, but an output-sensitive algorithm (relative to the number of intersection) runs in (2n*2+2k)*logn = nlogn + klogn

(2) Talk by Kyomin Jung (KAIST) on NP-Completeness
Interesting observation about how our brains do arithmetics: We remember the addition of two numbers between 0-9, like a turing machine storing 10^2 combinations!, then extends this knowledge to do more complicated calculations.   Turing machine is a device with a finite amount of read-only "hard" memory (states) and an unbounded amount of read/write tape-memory.  변하지 않는 부분의 메모리 (하드웨어) + 연습장 (소프트웨어)

Notes on P and NP: (a) P is the set of problems that can be solved in polynomial time, given an input size n  (b) NP is the set of problems that can be *verified* in polynomial time, Given candidate solutions, one can verify whether a solution is right or wrong in P time  (c) P is a subset of NP   (d) The question of P=NP is related to the meaning of intelligence. (Let's say you've solved NP. If P=NP, then there might not be intelligence in the brain that solved it)  (e) Cryptography relies on P!=NP. Public keys are given out, but the combinations to generate a public key is practically hard to get.  (f) Unless we know P=NP, it is important to develop approximation algorithms!

Hamiltonian Path Problem (HP) - visit each node exactly once
Traveling salesman problem (TSP) - find the shortest path that is HP

If P (HP) reduces in polynomial time to Q (TSP), then P is no harder to solve than Q.   (a) NP-hard: if all problems all R\in NP are reducible to P, then P is NP-hard; NP중에서 가장 어려운 것  (b) NP-complete: if P is NP-hard and P\in NP, P is NP-complete (NP중에서 가장 쉬울 수 있다)   (c) Cook-Levin theorem: the first real case of NP-complete

Combinatorial optimization, max cut, underlying assumption is again P!=NP:  (a)  rho (ALGO, G) : given a graph G, how efficient is algorithm ALGO?   (b)  rho (ALGO) = min_G ( rho(ALGO, G) ): given any G instances, what is the best algorithm to solve a problem?  (c) rho (maxcut) = max_algo ( rho(algo) ): how hard is the problem itself?

PTAS (polynomial time approximation scheme): 1+/-e of the optimal

(3) Talk by Jinwoo Shin (MIT) on Metropolis-Hastings Rule
Suppose we only know local information, how could you sample users uniformly random? This problem is to find random walk without bias.  (a)  Uniformly random property is p_ij = p_ji   (b)  Adding a self loop can fix this easily, but could be prone to even-odd number oscillations.   (c)  A seminar work: Metropolis-Hastings rule since 1950s

(4) Talk by Sang-il Oum (KAIST) on Parameterized Complexity
Concepts: (a) Vertex cover : a set of vertices meeting all edges  (b) vertex cover problem (G, k): does G have a vertex cover of size <= k?  (c) Minor of a graph: a graph that could be obtained by contracting nodes or deleting nodes or edges

FTP (fixed parameter tractable): a problem with a parametr k is FPT if it can be answered in time O(f(k) n^c) for a computable function f and a fixed c. that is exponential on a fixed parameter k, but not in the input size n. Rod Downey & Michael Fellows

Kernelization: A decidable problem is FTP if and only if it has a kernel
문제가 input 사이즈와 상관없이 사이즈 k에 대해 변형될 수있을때

(5) Talk by Heekap Ahn (Postech) on Geometric Graphs
Notations:  (a) A graph without cycle is a forest; a connected forest is a tree.  (b) A separating set of G will make the graph disconnected.  (c) k-connected, if separating set has cardinality of at least k.  (d) A graph is planar if it can be drawn in the plane wihtout crossing edges.  That is, a planer graph has a crossing number of zero.  (e) Geometric graph is a subset of topological graph.

Main definition: The crossing number, denoted by cr(G), of a graph G is the least possible number of pairs of crossing edges in a graph of G.  (a) Finding the crossing number is NP-hard.  (b) There exist efficient algorithms to find crossing number of smaller than k; that is the problem is fixed parameter tractable.

(6) Talk by Kyomin Jung on Randomized Algorithms
Define a turing machine that can through a binary random coin, called a probabilistic turing machine.  (a) What is good about it? Smoothes the "worst case input distribution" into "randomness of algorithm." (b) Law of large number property means that independent random samplings lead to an average that is close to the expectation, e.g., Erdos-Renyi random graph G(n,p) (when does a giant component appears?), voting poll (how to find good sample for a poll?).

Chernoff Bound:  (a) Suppose we have a coin with a probability of a head p, then the expected number of heads is 1*p + 0*(1-p) = p.  (b) After flipping a coin m times, the error rate is <= exp ( - lamda^2*m
) where lambda is an error.  (c) Las Vegas algorithm is a randomized algorithm that always gives correct answer.  (d) Monte Carlo algorithm has deterministic running time, but its output could be correct with a certain probability.

BPP (bounded-error, probabilistic, polynomial time): the set of problems that have Monte Carolo algorithms.  (a) Is BPP=P (deterministic language)?  (b) Pseudo-random generator (PRG) picks a random number not purely random, but is very hard to determine its randomness in polynomial time.

Randomized min cut algorithm by David Karger: Repeat until |V|<=2, pick a random edge and contract the edge. The two remaining nodes represent the cut points.

Aug 28, 2009

Iran election in Twitter

One of the most exciting events in social media would definitely be how the use of Twitter lead to some of the rallies and protests in Iran. Our team at MPI-SWS started looking at Twitter to study the patterns of information propagation. After numerous days of data collection and parsing, finally we are ready to investigate how tens of millions of users communicated with each other. Here's a sneak peek of our on-going research: (Disclaimer: this is definitely an exciting, yet preliminary result and could be changed later on.)

We've looked at whether users who posted tweet(s) on Iran election are connected in the social graph. Imagine the entire social graph of Twitter. Then mark all nodes (=users) who wrote at least one tweet about Iran election. Remove all other nodes in the social graph. Now focus on the remaining nodes in the network. What do they look like?

The plot above shows the size and the number of connected components, where each connected component represents a set of users who are connected by friendship. There were 200,000 users who talked about Iran election in our (sampled) dataset. Surprisingly, 85% of the users belonged to a single large component and 2% of the users to smaller tons. 15% of the users were singletons; they were not connected to any other users who talked about Iran election.

We initially expected to see a power-law distribution whose characteristic pattern is a straight line. This means that the number of connected component should have x^a relationship with the size of the connected component x. (a is called the power-law exponent). But we see two different exponents in the plot.

So why don't we see a straight line in the size distribution of connected components? I have several hypotheses for why we might see a multi-scaling trend, such as the language barrier and the effect of mass media.---I like these moments when I encounter unusual patterns. This is what makes research all the more challenging and fun.

Focusing on the largest connected component, users in this group do show a power-law distribution in their connectivity. Some users potentially influenced tweets of more than 1,000 others (meaning that these users had more than 1,000 fans who also wrote about Iran election); likewise, a user can be influenced by more than 1,000 others in one's subscription list. Both indegree and outdegree distributions follow a power-law trend; but interestingly these quantities turn out to be not related (correlation coefficient of 0.3066).

There are a lot that need to be done and I'm fascinated to investigate how social media like Twitter have changed the way we encounter new information and collaboratively propagate messages among users.

Aug 12, 2009

The 1st International Workshop on Mining Social Media

If you're working on social networks and data mining, here's a perfect workshop to consider at a wonderful south of Spain:

Mining Social Media (MSM'09)
Paper submission deadline: September 6th
Venue: November 9th, 2009 Seville, Spain