Step 5: Operationalizing Machine Learning on Qubole

GUIDE INDEX

How do we know which model we’ve trained is better for production? And moreover, how can we quickly estimate the costs of scaling that ML pipeline with production data? In this section we will cover:

Model Validation & Selection
Model Exporting
Model Deployment
Scheduling

Model Validation & Selection

Scroll down in the section titled “Precision vs. Recall” in the Customer Churn Notebook to follow along with the code on these next steps. There are more comprehensive courses and various methodologies on the model selection process, so we also will likely want to test our trained model against some production data to see how we perform at scale (which we will discuss in the next section).

In this section, we will be covering how to take your trained model outputs and compare them to our other models. We have 3 models that we are comparing in the this example – Random Forest, Logistic Regression, and Gradient Boosting Trees.

We will be collecting some important metrics from our models:

Performance – the ratio of accurate predictions to total predictions
Recall – how many of the predictions were actually relevant
Precision – the delta of how many relevant predictions were actual correct
ROC (Receiver Operating Characteristic) – compares how a model differentiates the true positive and negatives

We’ll start by importing our binary metrics from each model and creating a new Spark Dataset that we will store them in –

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Dataset
 
def initClassificationMetrics(dataset: Dataset[_]) : BinaryClassificationMetrics = {
    val scoreAndLabels =
      dataset.select(col("probability"), col("label").cast(DoubleType)).rdd.map {
case Row(prediction: org.apache.spark.ml.linalg.Vector, label: Double) => ( prediction(1), label)
case Row(prediction: Double, label: Double) => (prediction, label)
      }
    val metrics = new BinaryClassificationMetrics(scoreAndLabels)
    metrics
}

We then pass each of our models’ metrics into the data set. This example only shows the Random Forest example (visit this Notebook for the complete example) –

From here, we need to create a temporary table for each of our metrics that we will use use to visualize the metrics from each of our models –

val precisionAll = precisionRandomForest.unionAll(precisionLogistic).unionAll(precisionGBT)
val recallAll = recallRandomForest.unionAll(recallLogistic).unionAll(recallGBT)
val fMeasureAll = fMeasureRandomForest.unionAll(fMeasureLogistic).unionAll(fMeasureGBT)
val rocAll=rocRandomForest.unionAll(rocLogistic).unionAll(rocGBT)
val prAll=prRandomForest.unionAll(prLogistic).unionAll(prGBT)
 
precisionAll.createOrReplaceTempView("PRECISION_ALL")
recallAll.createOrReplaceTempView("RECALL_ALL")
fMeasureAll.createOrReplaceTempView("FMEASURE_ALL")
rocAll.createOrReplaceTempView("ROC_ALL")
prAll.createOrReplaceTempView("PR_ALL")

For the rest of this walkthrough, we will be using ROC curve visualizing in Plotly to select our model. The example Notebook contains all the other ways mentioned to measure a model’s outputs.

When it comes to binary classification models, a useful evaluation metric is the area under the ROC curve. The ROC curve is created by taking a binary classification predictor that uses a threshold value to assign labels given predicted continuous values.

As you vary the threshold for a model you cover from the two extremes, when the true positive rate (TPR) and the false positive rate (FPR) are both 0 it implies that everything is labeled “not churned” and when both the TPR and FPR are 1, then the data is labeled “churned.”

Using a random predictor that labels a customer as churned half the time, and not churned the other half, we get have a ROC that is a straight diagonal line. This line cuts the unit square into two equally-sized triangles, so the area under the curve is 0.5.

A AUROC value of 0.5 would mean that your predictor was no better as discriminating between the two classes than random guessing (e.g. a coin toss). The closer the value is to 1.0, the better its predictions are. A value below 0.5 indicates that we could actually make our model produce better predictions by reversing the answer it gives us.

As you can see from our output below, that the Random Forest model was slightly our highest performer (by a difference of 0.01) –

Exporting the Model

Qubole provides various options for productionalizing your Notebook code. You would normally want to schedule your code to run on a periodic basis.

Using PipelineModel’s write API, we can serialize the model to disk (and/or) object storage. We’ll start by exporting the model that we just trained into our cloud storage bucket –

%spark 
randomForestModel.write.overwrite().save("/user/telco_churn_rf.model_v0")

Qubole also allows you to easily select another data store such as Redshift or Snowflake Data Warehouse to export our results to.

Deploying Model

Once the model is trained, it can be served for production purposes, such as ranking or classifying your data. A model can be served as a batch or real time; in this case we are using a batch process –

%spark

import org.apache.spark.ml.PipelineModel

val pipelineModel = PipelineModel.load("/user/telco_churn_rf.model_v0")
val pipelineModelPredictions = pipelineModel.transform(test)
val pipelineMetrics = initClassificationMetrics(pipelineModelPredictions)
val aurocPipeline = pipelineMetrics.areaUnderROC
val auprcPipeline = pipelineMetrics.areaUnderPR

It is important in this process that we include our Model metrics, as we can use this down the line to see how our model is performing relative to the accuracy of previous deployments and our trained base model.

This is important so we can see if the model is performing to the standards that our business or operation requires, and determine if we need to go back and retrain the model or do further development.

Scheduling and Optimizing Deployment Workflow

Naturally, it is not recommended that you productionize a Notebook that is being used for development. Notebooks can also bring added memory overhead that you should consider when productionizing.

If you do choose to use Notebooks for production, Notebooks can also be scheduled to run on Spark.

For creating pipelines, you can use the Qubole Scheduler for simple workflows (such as running a Notebook, then running a direct query to Spark afterwards); and for more complex data operations, Qubole offers Airflow out of the box. Airflow is a lightweight technology that uses Python to build and manage complex DAG-based pipelines. The
Qubole operator, makes it very easy to use the Qubole API to run Spark Notebooks and other workloads.

Depending on the needs, there are various deployment options – porting code to Analyze Command and configuring with –archives, scheduling with Qubole Scheduler, using Airflow. Check out the webinar Best Practices: How To Build Scalable Data Pipelines for Machine Learning to learn how to use Airflow for productionizing ML applications.

As a final thought, if you are packaging code for production, then you should consider directly submitting the commands to Spark (rather than using Notebooks), as this allows for significant flexibility and reproducibility. With the spark-submit option –archives, you can deploy all dependencies for the job including the Python binaries. This way the job will not depend that much on the cluster configuration or on the version of the dependencies in the repositories.

This concludes the final section of our tutorial on “Getting Started with Apache Spark for Advanced Analytics”. For more Spark content, training, and community information, check out the Next Steps tab with tons of helpful resources!

References

TRAINING

Spark for DataOps (Qubole Training)

View Now

DOCUMENTATION

Qubole Airflow Quickstart Guide (docs)

View Now

NOTEBOOK

Customer Churn Workshop Spark Notebook

View Now

Continue to Next Steps

Step 5: Operationalizing Machine Learning on Qubole

Model Validation & Selection

Exporting the Model

Deploying Model

Scheduling and Optimizing Deployment Workflow

References

TRAINING

DOCUMENTATION

NOTEBOOK

Product

Company

Helpful Links

START YOUR FREE TRIAL OF QUBOLE

Contact Form

On-Demand Qubole Demo

Google Cloud Sessions

Thank you!

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

Step 5: Operationalizing Machine Learning on Qubole

Model Validation & Selection

Exporting the Model

Deploying Model

Scheduling and Optimizing Deployment Workflow

References

TRAINING

DOCUMENTATION

NOTEBOOK

START YOUR FREE TRIAL OF QUBOLE

Contact Form

On-Demand Qubole Demo

Google Cloud Sessions

Thank you!