Reddit Data Collection

by Tyriq Zvijer

Reddit is a community forum-based site that was founded on the 23rd of June in 2005. Each forum has the “r/“ tag before its specific topic. Users can join a wide range of topics that interest them. Useful or well-received content can get awards in the form of platinum, gold, or silver “coins.”

Data Collection

I collected posts from Reddit by scraping the platform. I got access to Reddit’s developer site which allowed me to start using their API. After some testing, I found that the Reddit API had a strict data limitation of 100 posts per query. This did not yield enough data during my initial tests. This meant that I would have to run multiple queries and account for the ones I had already found. After some research, I found an alternative solution in a python wrapper called PRAW (Python Reddit API Wrapper) that helped me get around this limitation. I set up a list of keywords to search for (“covid”, “covid-19”, “coronavirus”, and “covid19”), as well as a list of subreddits to search through (“AskReddit”, ”Coronavirus”, “collapse”, “politics”, “conspiracy”, “news”, “Conservative”, and “democrats”). I used for loops to look for every keyword within each respective subreddit and limited the number of posts returned for each subreddit to 50. I decided on this number because of how long my program took to fully search through my query. A limitation of 100 posts was still not done even after 2 full days. 50 posts reduced that time to a little over 16 hours. In total, the algorithm collected 4,606 comments across those 8 subreddits. I consulted Professor Rich Thompson and made sure that the amount of data returned would be sufficient for our analysis. After the data was cleaned and processed, it was then graphed for visualization.

Data Visualization

All graphs were graphed in R.

These point plots being nearly identical illustrate that there was an almost equal amount of negative and positive posts at the same time during the period examined. We thought this could be because there was so much news being published daily concerning the pandemic from various sources, with different opinions on matters. Also, because these were scraped indiscriminately regardless of popularity on the site we get an unfiltered view of the posts, as opposed to just examining the top posts from this period.

These point plots being nearly identical illustrate that there was an almost equal amount of negative and positive posts at the same time during the period examined. We thought this could be because there was so much news being published daily concerning the pandemic from various sources, with different opinions on matters. Also, because these were scraped indiscriminately regardless of popularity on the site we get an unfiltered view of the posts, as opposed to just examining the top posts from this period.

The compound score is the sum of the positive, negative, and neutral scores and is normalized between -1 and +1. The closer a line is to +1, the more positive it is. This graph reinforces the previous point that sentiment towards covid was very sporadic throughout the pandemic. However, the graph reveals a positive trend in sentiment.

On Reddit, we see a somewhat even split between positive and negative posts, with neutral being the minority. The posts in the subreddits that were observed are often just the title of the news article that is being linked in the post. These titles are tailored by news outlets to make the reader already feel a certain emotion before reading them, so there is no surprise when they are mostly negative or positive as posts like this will garner more attention than one that is neutral.

With these graphs, we can see the main emotion behind each of the posts that were gathered. As stated before, because most of these post titles share the same title as the article, which is being posted, we can see the sentiments users shared with news outlets. Trust and fear being the most-posted positive and negative emotion respectively, was something to be expected given that the posts mostly consist of article titles. Many headlines ask to trust in different policies and other aspects regarding the pandemic, and also with others to sow fear about policies and rising death tolls. Anticipation was another emotion that was widely posted. Many of these posts probably related to lockdowns being lifted and vaccinations being more readily available.

Word Cloud

This WordCloud is by Kalyssa Harris