derek ruths || network dynamics

Facebook Messenger Analytics – Part 1: Data Collection

By Jamie McCorriston & Morgane Ciot

Facebook offers an easily accessible API allowing data analysts or application developers obtain the public content of a user’s profile. While publicly viewable content offers an interesting dataset for social analytics, private messenger data contains a more raw version of a user’s online verbal footprint. Due to privacy constraints, Facebook does not allow external access to messaging data but users can at least retrieve their own sent and received messages. Retrieving your own private messenger data is not a trivial exercise but with the knowledge of a couple HTTP requests and the aid of a some basic Python packages, it can be done fairly quickly. As active Facebook Messenger users, we decided to take a stab at some fairly basic analytics using data from our chats with all of our friends. This blog post details our experience in collecting and analyzing Facebook Messenger data and is divided in two parts.

  1. In Part 1, for the more technically-oriented audience: how to collect our own Facebook Messenger data.
  2. In Part 2, for everyone: a series of experiments performed on our own Facebook Messenger data.

Part 1: Collecting Our Own Facebook Messenger Data

Getting your Facebook messenger data

Facebook’s developer toolkit doesn’t let us access our own messaging data. It is mainly composed of:

  • the Ads API to Facebook’s marketing platform, which lets us run create and manage ads, run analytics, and generally monetize an app
  • the Graph API, which lets us read and write public content (spanning games, events, groups, wall posts and comments, likes, shares, pages, photos, and videos)
  • the Facebook SDK, which allows us to do things like log in to an app using Facebook’s login system, share content by embedding a “like” button into an app, and access the Graph API (platform-specific SDKs provide more advanced functionality, like the Unity gaming SDK)

Facebook’s Chat API, now deprecated, was designed to allow websites and apps to use the messaging tool, rather than acquire any messaging data. That leaves us with having to scrape all of our data.

How To Scrape Our Own Facebook Data

If there’s a way to access data on a browser, there’s a way to get it programmatically. For our web scraping, we used Python’s requests, BeautifulSoup4 and urllib modules. They can all be installed using pip. Just using BeautifulSoup to scrape the HTML from our Facebook messages page is not a feasible option since only a few recent messages are stored directly in the HTML. The rest are loaded using ajax POST requests that return a JSON object containing a certain number of previous messages. The goal is thus to make this POST request ourselves so that we can programmatically get our complete message history.

Logging In

To start, we need to log into Facebook. This can be done by using a request session. First, we navigate to the Facebook login page, parse the HTML, and find the username/password input element.

# Starts the session
session = requests.Session() 
# Opens the Facebook login page in the session
homepage_response = session.get('https://www.facebook.com/').text
# Loads the login page into a BeautifulSoup soup object
soup = bs(homepage_response, "html5lib")
# Extracts the LSD token from Facebook login page, required for login post request
lsd = str(soup.find_all('input', attrs={"name": "lsd"})[0]['value'])

Next, we create and send a post request using the login data. This will create a session where we’re logged in and can subsequently access any other Facebook page using our account.

# Login data for the Login POST request
login_data = {
'email': username,
'pass': password,
'locale': 'en_US',
'non_com_login': '',
'lsd': lsd
} 
# URL for the login POST request
login_url = 'https://login.facebook.com/login.php?login_attempt=1'
# Logs in and stores the response page (our Facebook home feed)
content = session.post(login_url, data=login_data, verify=False).content

Finding the POST Request for Retrieving Message History

In order to determine the POST request required to load our message history, we use a debugging tool on our web browser. For this particular application, we go to our messages on Facebook and open up the browser’s debugger (such as Firefox’s Firebug or Chrome’s Dev Tools) by right-clicking on the page and selecting “Inspect Element.” This should open up a panel that shows us to see the page’s HTML. Clicking on the tab that lists the network calls being made (e.g. “Network” tab, “Console”, etc.) will show the HTTP requests being made when we are active on the page. By scrolling up and loading previous messages between two friends, we can identify the calls that are being made to Facebook.

post_requestFigure 1: A screenshot of the POST request for getting message history displayed in the browser’s debugger tool.

post_request2Figure 2: A more detailed list of parameters included with the message history POST request.

The POST Request Parameters

The anatomy of the POST request looks like this:

message_history_url = 'https://www.facebook.com/ajax/mercury/thread_info.php?&messages[user_ids][%d][limit]=%d&messages[user_ids][%d][offset]=%d&client=web_messenger&__user=%d&__a=1&fb_dtsg=%s' % (friend_id, json_limit, friend_id, offset, my_id, fb_dtsg)

We will need the following parameters:

  • our own ID
  • our friend’s ID
  • a limit specifying the maximum number of messages that are stored in one JSON response
  • the ‘offset’, or the point from which we would like to start collecting message history (0 means the most recent messages will be collected, 1000 means that the 1000 most recent messages will be skipped)
  • Facebook’s DTSG security token

User IDs

Our ID and our friend’s ID can be found by looking in the HTML. We can right-click on our list of messages and select “Inspect Element” to view the element’s HTML. Under the div labelled “List of message threads”, we should be able to see a list of our conversations with our friends’ user ids appearing as the list items’ id field. As for our own user id, we can find it by going into any conversation and inspecting one of the messages we’ve sent. The href associated with our name should contain our user id.

html_user_idFigure 3: A list of our friend’s IDs embedded in the HTML.

Limit

5000 seems to be a safe value; the request sometimes breaks with a JSON limit of over 5000. However, by manipulating the offset value, complete message histories of longer threads can be collected iteratively.

DTSG Token

Most websites that have login scripts requiring authentication use a security token. Facebook’s DTSG security token can be found in the HTML. We can find (ctrl+f) “fb_dtsg” in the page’s HTML and we’ll find the element containing the token (in the “value” field).

html_user_id2Figure 4: Facebook DTSG security token embedded in Facebook’s HTML.

Alternatively, extract it using a regex:

# Regex for extracting the fb_dtsg token from our facebook home page (after login)
FB_DTSG_REGEX = '<input type=\"hidden\" name=\"fb_dtsg\" value=\"(.*)\" autocomplete=\"off\"/>'

# Extracts the fb_dtsg token from the HTML, required for login post request
fb_dtsg = re.search(FB_DTSG_REGEX , login_response).group(1).split('\"')[0]

Figuring Out the JSON

The JSON object returned by the POST request is very clean and stores more than just message content. Here’s what it looks like:

facebook_jsonFigure 5: An example Facebook message, in the JSON format returned by the POST request used to get it.

Here’s the fully commented code for downloading all of our messages.

Alternative Solution: Browser Loading

For those who are into web development or have used a web testing framework before, there is an alternative way to load our messages. Using a module like Selenium, we can dynamically load content called by Javascript functions. This allows us to automatically scroll through infinite scrolling panels to access content that isn’t embedded in the HTML that we get immediately upon loading the page. We’ll need a scripted browser that allows us to automate web page interaction. Here, we use PhantomJS (brew install phantomjs works best on Macs) and the Splinter and Selenium Python libraries (pip install splinter installs selenium as well). To log in, create a browser instance and find the login form. We fill it with our username and password.

import selenium
from splinter import Browser
browser = Browser("phantomjs")
browser.visit("https://www.facebook.com/")
inputs = browser.find_by_id("login_form")[0].find_by_tag("input")
inputs[1].fill(username)
inputs[2].fill(password)

Now find the enter button and click it:

enter = browser.find_by_id("u_0_n").first
enter.click()
print “Logged in!”

We used this method to gather all of our friends’ user ids and store them in a text file, though we can also use the ajax calls method described above to do the same thing. View the full commented code here.

Subscribe

Subscribe to our e-mail newsletter to receive updates.

, ,

4 Responses to Facebook Messenger Analytics – Part 1: Data Collection

  1. Ousama July 11, 2016 at 10:57 am #

    Hi I’ve sent message to a friend of mine by mistake and on face book chat and would really want to delet it and couldn’t find way how to delet .could u help me .he is not abot now to access his message bcz he is out of office for longer than 1 month kindly reply me positive and immediate reply tnx

  2. Oliver November 8, 2016 at 8:54 pm #

    Facebook messenger call logs ( dates, times, duration, who calls, rating) as well as stickers and gifs are not included in the official personal archive download.

    Can you help me how to get this?
    Mainly I just need the call data mentioned between me and one other person. I have searched far and wide and I am not a developer (so far).
    Can you help me scrape facebook Messenger?

    I know its easy with skype. But facebook Messenger seems like a legal vermin to escape any liability.

  3. KD December 31, 2016 at 4:12 am #

    Love this! Excited for Part 2…

  4. Rafsun June 18, 2017 at 4:54 pm #

    Hey! For some accounts, thread_info.php returns empty payload as response. But it works for other accounts. Can you explain why?

Leave a Reply