For all the wonderful things we hear about how compute clusters enable the analysis of massive datasets, the sad truth is that few researchers can use them to analyze large network datasets. This is due both practical and theoretical issues.
From a practical perspective, clusters are expensive, their administration requires non-trivial time and technical knowledge, and the tools for using them aren’t user friendly. On top of this, algorithms that run on network data are notoriously difficult to parallelize. As a result, few common network analysis algorithms have good parallelized versions or are implemented in standard tools.
Despite all this, the average size of network-based datasets is growing. This means that more and more researchers need to do large network analysis.
Earlier today at the International Conference for Weblogs and Social Media, I ran a tutorial that covered a range of easy techniques for making analysis of massive network datasets possible on standard-issue desktop and laptop computers.
The presentation slides as well as demo scripts are all online and publicly available, so please feel free to take a look.
One participant suggested that I should start blogging about these and other techniques for large-scale data analysis, which sounded like a great idea to me. If you’re interested, please visit this space in the coming days and weeks as I’ll start making regular posts about these and other issues in social data analysis. And if you have questions about or requests that I discuss certain topics, don’t hesitate to email me.