Summary: Performed natural language processing using the SciKit-Learn Python library to analyze a CSV containing Reddit posts and predict how many upvotes a post would get.
Provided with a collection of posts from reddit.com from numerous programming-focused subreddits, I was tasked to come up with a target to predict, and make models to do so.
Upon conclusion of EDA and cleaning the data I decided that my goal would be to do a regression model predicting the amount of upvotes a post would get. The reason I chose this is that being able to predict if a post will get lots of upvotes (which is directly related to how visible it will become to more people) can be a very valuable thing. For example, some companies can have large portions of their community based in a subreddit, if you can predict what becomes popular it could let a Community Manager prioritize where they interact to get ahead of things, or aid in advertsing efforts. Despite none of the columns looking very predictive of upvotes, I didn’t want that to stop me from working to predict something valuable, and I thought it would be interesting to use mainly natural language processing to see if you can predict how popular a post will get.
To that end I assumed I was looking at a post at the begginning of it’s lifetime, and therefore removed information from the dataset that you would have into the posts lifetime, such as number of comments, and being gilded. The main challenge faced was that there were a few huge outliers for score, and then the vast moajority of posts only have a couple of upvotes.
At the conclusion of the project I had working model that did better than baseline, but there’s definitely still room for improvement. If I were to return to this project, the next step would be to turn this into a classification problem. If I were to choose a a cutoff for upvotes and say anything below that is a 0 (unpopular) and above is 1 (popular) I think that may be a more promising and usefull model for exposure as opposed to predicting the exact amount of upvotes. Reddit also has a “Rising” page for subreddits that identify newer posts getting lots of activity. I think it would be valuable to take the subset of posts that make it to the “Rising” page, and try to predict which of those become popular, as opposed to trying to predict from brand new posts. This would also be easier to predict accurately for the following reason: since I set my goal to be predicting from the creation of a post, I dropped the most predictive feature, the number of comments, so as not to look at the problem from the future. Looking at posts on the “Rising” page, they would already have some amount of comments and upvotes since their creation, which would be an incredibly useful metrics to look at for whether they would move on and be popular or fizz out.