derek ruths || network dynamics

Parsing Reddit – PRAW to the Rescue

If you happen to be one of millions of users afflicted with Reddit addiction who has managed to shake off their finger-cramping, eye-twitching obsession for long enough allow your pupils to re-adjust to the three-dimensional world, you may have realized Reddit’s potential as a multi-faceted resource. Reddit is a website where users post content that they find interesting to share with the community. Its collection of subbreddits, which is Reddits way of organizing their content into specific topics, is an aggregation of its users’ interests and ideas. For a personal user, it offers a vast network of interest-oriented communities capable of satisfying the spectrum of a user’s curiosity.  For a researcher, the platform opens up an endless array of exciting questions about how this enormous community behaves and organises.

Reddit has registered more than 150 million unique visitors with 7 billion pageviews in only its past month from across 214 different countries. This multinational, multiethnic, multicultural melting pot has given rise to a diverse community which provides incredible opportunities for research. Imagine the potential of performing a detailed demographic analysis over a large population that varies across age, gender, race, and political inclination. Consider the illuminating data that could be gleaned from studying the growth of a community, the rise and fall of a thousands of interconnected groups. With programmatic access to Reddit, all the necessary content is at our fingertips—we only lack the proper tools to manipulate it.

This is where PRAW comes to the scene. PRAW (Python Reddit API Wrapper) is an easy-to-use Python package that can extract Reddit data through the official API. The wrapper allows for three levels of information extraction:

1) Subreddit

2) Submission

3) Comments

Before we delve into that, we first need to set up PRAW.

If this is your first time using PRAW you will have to install it before you begin. The recommended way to install is via pip:

$ pip install praw

(If you do not have pip installed then click here for instructions on how to set it up.)

Next, we import PRAW and initiate our Reddit object.

>>> import praw

>>> self.reddit = praw.Reddit(useragent)

We have to start by importing PRAW and connecting to it using the useragent field. The useragent field is used to uniquely identify our script. More information about the useragent can be found here.

Now let’s get right down to it: how do we extract interesting data? We will address each level individually.

SUBREDDIT

To access subreddits, which organize Reddit into its many topic categories, PRAW allows us to access different subreddits based on the parameters that we set as well as the api calls that we use. If we are looking for the most popular subreddits then we would want to use the get_popular(limit) command. This will return the most active subreddits up to the limit that you set. If we are looking for the newest subreddits, then the command get_new_subreddits(limit) would be the best choice. Other options would be get_random_subreddit(),which returns a random subreddit, as well as get_my_subreddits(), which returns the subreddits which the sessions user is subscribed to. If you happen to have a Reddit account, you can use that account as a session user and perform actions on behalf of that account or gather information about that account. PRAW also allows you to login to a Reddit account and specify a session’s user with the command login(username,password).

PRAW returns a list of subreddit objects that match our query. For example, using the PRAW object we created earlier :

>>> subredditList = reddit.get_popular_subreddits(limit = 2)

>>> for subreddit in subreddits:

…    print subreddit

funny

AdviceAnimals

SUBMISSION

After we select the subreddits, suppose we are interested in assessing what people are submitting to them. Again, we have option to decide how we want to select submissions.

There are a few different ways to collect submissions for a particular subreddit. First, we have to get the subreddit object for the subreddit of interest. We do this using the command subredditObject = reddit.get_subreddit(subredditName) (reddit is the PRAW object we created earlier). Now with our subredditObject we have several commands available to us to collect the submissions. Looking at the webpage for a subreddit, we can see that we can order the submissions based on a few different criteria. These include ‘hot’, which has the recent most active submissions, ‘new’, which has the newest submissions, ‘top’, which are ordered by the vote tally for the submissions, as well as many others. In order to collect the submission based on these criterion we simply use the command get_<criterion>(limit).

For example, get_hot(5) would return a submission object with all the associated attributes for the five most active submissions. These attributes include all the available information about the objects, like when the submissions were posted, who posted them, and the number of up and down votes for each, etc.

Here’s a full example, using the subreddit object we created earlier :

>>> submissionObjects = subredditObject.get_hot(limit = 1)

>>> for submissionObject in submissionObjects:

…    print submissinoObject.num_comments

…    print submissionObject.ups

…    print submissionObject.downs

5

9

2

The associated attributes can be very informative. There is a long list of attributes for each submissionObject, so let’s spend some time on looking at some of the more useful ones available to us.

  • submissionObject.subreddit : Subreddit to which the submission belongs.
  • submissionObject.created : Timestamp of when the submission was created.
  • submissionObject.ups : Total upvotes for the submission, all users are allocated 1 upvote per submission.
  • submissionObject.num_comments : All users are allowed to comment on the submissions. This shows the total number of comments for this particular submission.
  • submissionObject.selftext : The text portion of the submission. This is often left empty if the submission is a link to content on another website.
  • submissionObject.downs : Total number of downvotes for the submission. Can be used in conjunction with karma to calculate the number of upvotes.
  • submissionObject.link_flair_text : way of dividing users with a symbol, usually used in sport subreddits to show which team the user supports or for similar reasons.
  • submissionObject.title : Title of the submission.
  • submissionObject.author : The creator of the submission

There are many other attributes available for each submission. If we want to see all of these attributes we can use the command vars(submissionObject)which will display a JSON object with all the attributes.

COMMENTS

Delving a little deeper, we can access  the comments of each of the submission objects separately.

Included in the submission object that we have collected is the forest of comments associated with it. Top level comments are treated as the tree roots, with lower layered comments branching off from their parent. You can access these comments using submissionObject.comments. PRAW returns a list of comment objects for each comment with its associated attributes. It is then easy to filter out comments from the data included in the attributes of the comment object, depending on your purpose for collecting the information.

For example:

>>> for commentObject in submissionObject.comments:

…    print commentObject.ups

…    print commentObject.downs

3

2

We can now look at some of the richest data that we can collect from Reddit. Knowing what all these attributes are trying to tell us is very important in trying to get the full potential from what we have collected. Here is some more information about a few of the most useful attributes from the comment object.

  • commentObject._submission : Submission to which the comment belongs.
  • commentObject.created : Timestamp of when the comment was created
  • commentObject.body : Content of the comment.
  • commentObject.score : Sum of the upvotes and downvotes for the comment. Upvotes are awarded +1, downvotes are awarded -1.
  • commentObject.ups : Total number of upvotes for the comment.
  • commentObject.downs : Total number of downvotes for the comment.
  • commentObject.gilded : Users are allowed to gift gold to comments of their choosing. Generally these are for exceptionally high quality comments. Gold gives the user special privileges on the website. This is one of the main ways that Reddit makes money from the website.
  • commentObject.edited : True or False based on whether the comment has been edited.

To access all the available attributes use the command vars(commentObject) and a JSON object containing all the attributes will be printed.

The following example shows how we can put everything together to collect the Reddit content that we desire :

>>>import praw

>>>reddit = praw.Reddit(“/u_Tuna_”)

>>>subreddit = reddit.get_subreddit(‘news’)

>>>submissions = subreddit.get_hot(limit = 5)

>>>for submission in submissions:

…    print submission.domain

…    f = open(submission.domain + ‘.txt’,’a’)

…    for comment in submission.comments:

…        f.write(comment.body)

…    f.close()

nj.com

nytimes.com

nymag.com

newson6.com

theneworleansadvocate.com

In this example, we first initiate the Reddit object, ‘reddit’, with our useragent “/u/_Tuna_”. Next we collect the news subreddit object using the get_subreddit(‘news’) command. This command will provide access to the submission and comments in the news subreddit. We then collect five submission objects using the get_hot(limit = 5) command, which provides the top five hits from the subreddits ‘hot’ filter.

Now for the good stuff. The first for-loop allows us to iterate through each of the top five ‘hot’ submissions and collect every user’s comment. The embedded for-loop writes those comments into .txt file with the name of the domain attribute for the submission that we are collecting from (i.e. nj.com.txt) Pretty neat, eh?

From there, it’s up to you to decide how to analyze the data you’ve just collected. Perhaps you want to determine people’s opinions on the news source, articles, or the subject that is being reported. Perhaps you just want to take a minute to bask in the power of your sweet new skills. It’s OK. We’ve all been there.

Now, we can see the websites that the users linked to for their submission by accessing the domain attribute of the submission object and then printing it. This information could be very useful in determining the type of sources Reddit users get their news from. Next we look at the comments attribute of the submission object and collect the text of each comment by accessing the body attribute of the comment object. Finally, we write all the text to a file so that we can later do analysis on it, perhaps to determine people’s opinions on the news source, articles, or the subject that is being reported.
You now have all the tools we need to start collecting data from Reddit using PRAW. To learn more about PRAW and all the amazing things that we can do with it, go to their website found here.

Subscribe

Subscribe to our e-mail newsletter to receive updates.

, ,

No comments yet.

Leave a Reply