Sparkify Customer Churn Prediction — DSND Capstone Project

Project Definition

The project is based on Udacity’s Udacity’s Data Scientist Nanodegree program.

Udacity’s Sparkify is a virtual company such as the other music streaming services Spotify or Google Music. Udacity has provided the dataset that contains a customer behavior log from October to November 2018. Customer log holds time-based information (Unix time seconds since 1970) of every activity that customer has made e.g. registration day, length of sessions, and page visited (the main information of customer behavior).

The dataset is 12 GB so we will need Amazon Web Services (AWS) Elastic MapReduce (EMR) and Spark to process the data. The dataset…

12GB dataset ETL, EDA, feature engineer, analyze, modelling three machine learning models and tuned all of them, and the whole above-mentioned process two times with the costs shown above. Most of the spendings went with configuring the cluster and installing python libraries.

Have you ever got stuck when paid virtual clusters are running and you’ll try to diagnose where the cause is? After reading this blog you’ll find some tips to avoid costs.

The main reason for writing this blog was to show the results I got from the user churn assignment in Udacity’s Data Science Nanodegree program but during the process, I ran into practical issues, and therefore I’d like to share some tips for reducing costs when using virtual clusters.

The assignment was to build a model to predict customer churn for Udacity’s virtual company Sparkify, which is like the…

