Machine Learning Movie Recommendations Using AWS Sagemaker

Goal

The goal of this project was to build a Netflix style recommendation engine with AWS SageMaker and other ML tools.

Typically, in any Machine Learning project the Data Preparation phase takes up about 70% of the project time and this proved true for me in this project. For me about 90% of my time was spent in learning ML, learning Jupyter, understanding the IMDb sourced data, cleaning it and preparing it. The remaining 10% was spent on the LAMP (Linux Apache MySQL PHP) setup to deliver the recommendations to the public.

Project Description

Using IMDb Datasets, AWS SageMaker, Jupyter hosted notebook, Python data science libraries and exploring matplotlib, scikit-learn and the k-means learning algorithm I created a Netflix Style Recommendation Engine.

Main Steps

Learning

At the beginning of this project I knew very little about Machine Learning and, thus, I embarked on a step learning curve. In this learning endeavor I used these resources:

Introduction to Machine Learning.PNG

Introduction to Jupyter Notebooks.PNG

Of course what project can be completed without lots of help from various authors and websites and of course stackoverflow.com

With a foundation of technical knowledge I began the technical work which is broken into five categories.

1. Create Jupyter hosted notebook

To start the data inspection process, I launched a Jupyter hosted notebook on Amazon SageMaker. I used Python and various data science libraries like NumPy and Pandas’ DataFrame to work with the IMBd data.

2. Inspect and visualize data

It was important to gain domain knowledge of the IMDb data so that I could easily detect anomalies and outliers. I used Matplotlib and Seaborn for this step.

3. Prepare and transform data

The next step was to put the data in a format a machine can learn from. The IMDb data includes movies in all languages. The ratings range from 0 to 10 and include the number of votes. For this project I decided to only use English movies and since I want to provide good recommendations I eliminated movies that did not have 500 votes and a rating higher than 7. Since Machine Learning algorithms like numerical data I need to remove or convert all textual information. For the genre information, I converted it into One Hot Encoding (OHE) and then removed the genre column. For the movie title I concatenated it with the index and then removed the movie title column. You can see all of this data transformation and feature engineering in my Jupyter Notebook file found at my github repository

4. Train

Once the data was prepared, the training process began using the selected machine learning algorithm. The algorithm clustered or grouped the IMDb data in order to make recommendations. The plan was to use the k-means clustering algorithm. I considered using two providers: Amazon SageMaker provides a k-means clustering algorithm and so does scikit-learn. I was successful with Amazon SageMaker and so I did not explore scikit-learn's version. I encountered on SageMaker issue that took me several days to resolve. When I first tried to train I got a "413 Request Entity Too Large error" when calling the kmeans predictor predict method. It turns out that when you call the "kmeans_predictor.predict", it means you will invoke the sagemaker endpoint to process the prediction and in the AWS documentation you can find "Maximum payload size for endpoint invocation is 6 MB. Therefore, I split the training data in two parts and invoked the endpoint twice to get two predictions, then I combine two results into one in order to fit the original format.

5. Recommend

With clusters identified and labelled in the Pandas Dataframe (see Jupyter Notebook) I was able to export the CSV to a MySQL database and with some simple PHP and SQL queries I created a basic webpage that provides movies recommendations. This movie recommendation webpage can be found at Brian's Movie Recommendation Site. I will leave this site up for a few days.

Implemented Architecture

The diagram below depicts the architecture I implemented. Just to be clear all the Machine Learning elements (Jupyter Notebook, SageMaker, etc) to the left of the VPC took up about 90% of the work. The VPC, webserver, database server was created by a CloudFormation template.

ml-movie-recommendations.png

Conclusion

This was a very challenging project, but I learned a lot. I am far from being a Machine Learning expert, but neither am I a novice any longer. My interest has certainly been kindled in something which I formerly had very little interest. What I produced works, but it certainly could be refined. I am not completely satisfied with the recommendations that are produced and would like to improve on that.