How do we know which model we’ve trained is better for production? And moreover, how can we quickly estimate the costs of scaling that ML pipeline with production data? In this section we will cover:
Scroll down in the section titled “Precision vs. Recall” in the Customer Churn Notebook to follow along with the code on these next steps. There are more comprehensive courses and various methodologies on the model selection process, so we also will likely want to test our trained model against some production data to see how we perform at scale (which we will discuss in the next section).
In this section, we will be covering how to take your trained model outputs and compare them to our other models. We have 3 models that we are comparing in the this example – Random Forest, Logistic Regression, and Gradient Boosting Trees.
We will be collecting some important metrics from our models:
We’ll start by importing our binary metrics from each model and creating a new Spark Dataset that we will store them in –
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.rdd.RDD import org.apache.spark.sql.Dataset def initClassificationMetrics(dataset: Dataset[_]) : BinaryClassificationMetrics = { val scoreAndLabels = dataset.select(col("probability"), col("label").cast(DoubleType)).rdd.map { case Row(prediction: org.apache.spark.ml.linalg.Vector, label: Double) => ( prediction(1), label) case Row(prediction: Double, label: Double) => (prediction, label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) metrics }
We then pass each of our models’ metrics into the data set. This example only shows the Random Forest example (visit this Notebook for the complete example) –
From here, we need to create a temporary table for each of our metrics that we will use use to visualize the metrics from each of our models –
val precisionAll = precisionRandomForest.unionAll(precisionLogistic).unionAll(precisionGBT) val recallAll = recallRandomForest.unionAll(recallLogistic).unionAll(recallGBT) val fMeasureAll = fMeasureRandomForest.unionAll(fMeasureLogistic).unionAll(fMeasureGBT) val rocAll=rocRandomForest.unionAll(rocLogistic).unionAll(rocGBT) val prAll=prRandomForest.unionAll(prLogistic).unionAll(prGBT) precisionAll.createOrReplaceTempView("PRECISION_ALL") recallAll.createOrReplaceTempView("RECALL_ALL") fMeasureAll.createOrReplaceTempView("FMEASURE_ALL") rocAll.createOrReplaceTempView("ROC_ALL") prAll.createOrReplaceTempView("PR_ALL")
For the rest of this walkthrough, we will be using ROC curve visualizing in Plotly to select our model. The example Notebook contains all the other ways mentioned to measure a model’s outputs.
When it comes to binary classification models, a useful evaluation metric is the area under the ROC curve. The ROC curve is created by taking a binary classification predictor that uses a threshold value to assign labels given predicted continuous values.
As you vary the threshold for a model you cover from the two extremes, when the true positive rate (TPR) and the false positive rate (FPR) are both 0 it implies that everything is labeled “not churned” and when both the TPR and FPR are 1, then the data is labeled “churned.”
Using a random predictor that labels a customer as churned half the time, and not churned the other half, we get have a ROC that is a straight diagonal line. This line cuts the unit square into two equally-sized triangles, so the area under the curve is 0.5.
A AUROC value of 0.5 would mean that your predictor was no better as discriminating between the two classes than random guessing (e.g. a coin toss). The closer the value is to 1.0, the better its predictions are. A value below 0.5 indicates that we could actually make our model produce better predictions by reversing the answer it gives us.
As you can see from our output below, that the Random Forest model was slightly our highest performer (by a difference of 0.01) –
Qubole provides various options for productionalizing your Notebook code. You would normally want to schedule your code to run on a periodic basis.
Using PipelineModel’s write API, we can serialize the model to disk (and/or) object storage. We’ll start by exporting the model that we just trained into our cloud storage bucket –
%spark randomForestModel.write.overwrite().save("/user/telco_churn_rf.model_v0")
Qubole also allows you to easily select another data store such as Redshift or Snowflake Data Warehouse to export our results to.
Once the model is trained, it can be served for production purposes, such as ranking or classifying your data. A model can be served as a batch or real time; in this case we are using a batch process –
%spark import org.apache.spark.ml.PipelineModel val pipelineModel = PipelineModel.load("/user/telco_churn_rf.model_v0") val pipelineModelPredictions = pipelineModel.transform(test) val pipelineMetrics = initClassificationMetrics(pipelineModelPredictions) val aurocPipeline = pipelineMetrics.areaUnderROC val auprcPipeline = pipelineMetrics.areaUnderPR
It is important in this process that we include our Model metrics, as we can use this down the line to see how our model is performing relative to the accuracy of previous deployments and our trained base model.
This is important so we can see if the model is performing to the standards that our business or operation requires, and determine if we need to go back and retrain the model or do further development.
Naturally, it is not recommended that you productionize a Notebook that is being used for development. Notebooks can also bring added memory overhead that you should consider when productionizing.
If you do choose to use Notebooks for production, Notebooks can also be scheduled to run on Spark.
For creating pipelines, you can use the Qubole Scheduler for simple workflows (such as running a Notebook, then running a direct query to Spark afterwards); and for more complex data operations, Qubole offers Airflow out of the box. Airflow is a lightweight technology that uses Python to build and manage complex DAG-based pipelines. The
Qubole operator, makes it very easy to use the Qubole API to run Spark Notebooks and other workloads.
Depending on the needs, there are various deployment options – porting code to Analyze Command and configuring with –archives, scheduling with Qubole Scheduler, using Airflow. Check out the webinar Best Practices: How To Build Scalable Data Pipelines for Machine Learning to learn how to use Airflow for productionizing ML applications.
As a final thought, if you are packaging code for production, then you should consider directly submitting the commands to Spark (rather than using Notebooks), as this allows for significant flexibility and reproducibility. With the spark-submit option –archives, you can deploy all dependencies for the job including the Python binaries. This way the job will not depend that much on the cluster configuration or on the version of the dependencies in the repositories.
This concludes the final section of our tutorial on “Getting Started with Apache Spark for Advanced Analytics”. For more Spark content, training, and community information, check out the Next Steps tab with tons of helpful resources!
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.
Thank you!
You’ll hear from us shortly.
See what our Open Data Lake Platform can do for you in 35 minutes.
We’ll be in Touch