Projects

Predicting the Stock Market

Summary: Integrated financial information from the Alpha Vantage API into a neural network built using TensorFlow’s Python library to predict the S&P 500 closing stock price on the following day.

My data is mainly source via an API provided by the website AlphaVantage. I explored the options of using GoogleTrends data as well as Twitter data to complement that, but ended up deciding that it was not a good option for the scope of this project.

Due to it’s nature (purely numeric) there was not a huge need for complex transformations or feature engineering, rather the difficulties were in dealing with the timeseries nature.

Working with Stock data, what I found most important to decide, was whether the price of a stock will go up, or down. To this end I turned the focus of my project from a regression, where I attempted to predict the closing price, to a classification, where I try to predict if the price will change positively or negatively.

I used Tensorflow is that the granularity of control allows me to be generating regerssion values as well as classification predictions, while optimizing on the classification part of the problem, where if I used an sklearn regression it would optimize for regression metrics only.

To deploy this model, I envision it being set up on a server that is updated daily. Every day it can be used to generate the predicted closing price for that day (or the next depending on the time), at the end of the day once the true value is known, the Neural Net can update itself based on the now newly known data. Additionally, I imagine that a model of this type should eventually be built to work as a piece of a trading algorithim, with the model providing belief about what will happen in the future.

As for modeling techniques, I think it would be valuable to see how using a Bayesian approach would work for this kind of time series data. Additionally I would like to expand upon the idea of using a Neural Net to estimate both the Regression (change in price) and Classification (did price increase) parts of the problem, as right now I only have a very basic versino of that duality working.

You can also see the GitHub repository for this project here!

This Website

Summary: This very website is a project in and of itself. My goal in creating it is to host a blog for any topics I want to discuss, to host a portfolio of sorts, and to learn in the process.

This website is built ontop of the Jekyll-Now project and host via GitHub Pages. By starting from a template I deconstructed and learned how Jekyll works (it turns out to be fairly straight forward and uses Ruby), and rebuilt the site to fit my purposes using my knowledge of HTML, CSS, and Bootstrap.

The downside of GitHub Pages is that it doesn’t allow server side code. Therefore the next steps of developement of the website are to move the hosting to a cloud service such as AWS to enable me to expand the uses of the website (such as including live demos of models). The process of moving to the cloud, will be a great chance for me to gain more exposure to all the steps involed web hosting, as well as provide a test bed for future web projects.

You can also see the GitHub repository for this project here!

Predicting Reddit Post Popularity

Summary: Performed natural language processing using the SciKit-Learn Python library to analyze a CSV containing Reddit posts and predict how many upvotes a post would get.

Provided with a collection of posts from reddit.com from numerous programming-focused subreddits, I was tasked to come up with a target to predict, and make models to do so.

Upon conclusion of EDA and cleaning the data I decided that my goal would be to do a regression model predicting the amount of upvotes a post would get. The reason I chose this is that being able to predict if a post will get lots of upvotes (which is directly related to how visible it will become to more people) can be a very valuable thing. For example, some companies can have large portions of their community based in a subreddit, if you can predict what becomes popular it could let a Community Manager prioritize where they interact to get ahead of things, or aid in advertsing efforts. Despite none of the columns looking very predictive of upvotes, I didn’t want that to stop me from working to predict something valuable, and I thought it would be interesting to use mainly natural language processing to see if you can predict how popular a post will get.

To that end I assumed I was looking at a post at the begginning of it’s lifetime, and therefore removed information from the dataset that you would have into the posts lifetime, such as number of comments, and being gilded. The main challenge faced was that there were a few huge outliers for score, and then the vast moajority of posts only have a couple of upvotes.

At the conclusion of the project I had working model that did better than baseline, but there’s definitely still room for improvement. If I were to return to this project, the next step would be to turn this into a classification problem. If I were to choose a a cutoff for upvotes and say anything below that is a 0 (unpopular) and above is 1 (popular) I think that may be a more promising and usefull model for exposure as opposed to predicting the exact amount of upvotes. Reddit also has a “Rising” page for subreddits that identify newer posts getting lots of activity. I think it would be valuable to take the subset of posts that make it to the “Rising” page, and try to predict which of those become popular, as opposed to trying to predict from brand new posts. This would also be easier to predict accurately for the following reason: since I set my goal to be predicting from the creation of a post, I dropped the most predictive feature, the number of comments, so as not to look at the problem from the future. Looking at posts on the “Rising” page, they would already have some amount of comments and upvotes since their creation, which would be an incredibly useful metrics to look at for whether they would move on and be popular or fizz out.

You can also see the GitHub repository for this project here!

Ames Housing Data

Summary: Used feature engineering with the Ames Housing Dataset to predict the expected value of houses in Ames, Iowa.

The Ames Housing Dataset is an exceptionally detailed and robust dataset with over 70 columns of different features relating to houses. I was tasked with creating two models to predict different targets, being a regression problem to predict the price of a house at sale, and a classification problem predicting whether a house sale was abnormal or not.

During my Data Analysis I determined that were 3 types of data. Nominal columns, representing categorical information, Ordinal columns representing rankings, and then purely numeric columns. I used those grouping to most efficiently transform the data into the most useful form.

With my data cleaned, I then proceeded to approach each problem by trying out many models using sci-kit learn’s GridSearch to find the best hyperparamters for each model. After determining which models worked best, which ended up being a GradientBoost for predicting house price, and a RandomForrest for determinging if the sale was abnormal.

You can also see the GitHub repository for this project here!