derek ruths || network dynamics

Geoinference

Project Summary

Social media provides a continuously-updated stream of data which has proven useful for analyzing various aspects of human behavior. Associating data with the particular geolocation from which it originated creates a powerful tool for modeling geographic phenomena, such as tracking the flu, predicting elections, or observing linguistic differences between groups. However, only a small amount of
social media data comes with location; for example, less than one percent of Twitter posts are associated with a geolocation. Geoinference attempts to infer the origin location for the majority of social media content that is unlocated. Despite significant interest in developing geoinference methods, little work has been done on comparing the methods or even on proving the community with open-source usable implementations of the algorithms. Thus, the Network Dynamics Geoinference project comes with three key goals:

  • To develop a rigorous comparison and evaluation of current state of the art
  • To enable easy testing, development, and comparative evaluation of new methods
  • To create new geoinference techniques that readily scale and provide accurate inferences

The Network Dynamics Geoinference Library

Many state-of-the-art geoinference algorithms involve significant complexity. However, most have not been released open source, which imposes a significant burden for effective comparison on the same data and for gaining insights to improve the techniques. Therefore, we have created the Network Dynamics Geoinference Library, which features nine recent algorithms implemented in a common framework and API. Algorithms are provide in Java or Python and can serve as reference implementations. Furthermore, we have also released a set of comprehensive evaluation methods that now make effective comparison possible across different approaches and datasets. For full details, see our paper in ICWSM 2015.

David Jurgens, Tyler Finethy, James McCorriston, Yi Tian Xu, and Derek Ruths. Geolocation Prediction in Twitter Using Social Networks: A Critical Analysis and Review of Current Practice. Proceedings of the 9th International AAAI Conference on Weblogs and Social Media (ICWSM). 2015

Geoinference FREESR

A key limitation for geoinference in Twitter is the need for large quantities of Twitter data, which many researchers do not have access to and, due to Twitter’s Terms of Service, cannot be shared between researchers. Furthermore, the scale and size of most Twitter datasets used for geoinference make them infeasible to be replicated using Twitter’s API, which effectively prevents testing a new methods on the dataset used in prior work. To overcome this limitation, we have proposed a new framework for evaluation in which experimenters move their computation models to the data where they are evaluated remotely on the data, without requiring direct local access to the data itself. We refer to this model as FREESR (Framework for Reproducible Evaluation of Experiments with Sensitive Resources). By hosting a geoinference task in a FREESR instance, researchers can test their model on identical data as others without needing to have access to the data itself.

We have released a publicly-usable FREESR instance for the datasets used in the cross-validation experiments of our ICWSM 2015 paper on geoinference. The following resources are available for the community’s benefit: