How Qubole Maximizes Spot Utilization and Reduces Costs
One of our customers—a large enterprise cloud content management company—runs several sophisticated Machine Learning (ML) predictive models daily for its retail clients. These models use large volumes of data, such as products, line items, and purchase orders, among others. A necessary prerequisite is to prepare the data for training and running these ML models. So this customer runs a daily data Extract, Transform, and Load (ETL) job using Apache Spark on Qubole. However, the job would run for over forty-six minutes and fail intermittently, causing reliability issues, delays, and troubleshooting problems—aside from cost overruns and the business ramifications for its clients. So it was essential we help this customer resolve these issues and ensure the job ran reliably with the best performance possible.
In this three-part blog series, I will explain how we helped this customer optimize their Spark clusters for improved performance and cost while ensuring reliability, using innovative tools in Qubole’s Open Data Lake platforms, such as Qubole Cost Explorer and SparkLens.
- The first blog explains how you can leverage heterogeneous configurations in Qubole clusters to maximize the chance of getting Spot nodes.
- The second blog explains how a Spark cluster on Qubole can be configured to achieve higher Spot utilization, which optimizes the job for lower costs while ensuring reliability.
- The third blog covers how a spark job can be fine-tuned for better performance using Qubole Sparklens, a tool we open-sourced for visualizing and optimizing jobs in Spark
Leveraging Heterogeneous Configurations in Qubole Clusters
Consider the following example, where a cluster is configured with the following configurations:
- Minimum Worker Nodes = 2
- Maximum Worker Nodes = 250
- Master Instance Type = r5.4xlarge (16 cores, 128 GB mem)
- Worker Instance Type = r5.2xlarge (8 cores, 64 GB mem)
- Heterogeneous Configuration = Disabled
- Fallback to On-demand
- Option a: Enabled
- Option b: Disabled
Qubole’s automated Cluster Lifecycle Management capabilities help reduce cloud costs significantly by ensuring a cluster is running only when there are active workloads. To start with clusters are down, avoiding unnecessary compute costs.
Upon the arrival of the first workload or when a scheduled interval condition is met, the Qubole cluster warms up to the minimum configuration. The minimum configuration in this example will comprise one master node of type r5.4xlarge and two minimum worker nodes of type r5.2xlarge.
When the worker instance type configured in the cluster is relatively new happens commonly, when it becomes very popular, and is either not available in the Spot market or is running low on availability. Let’s assume that the worker instance type configured in this example, r5.2xlarge, is one such recently introduced.
Now, consider the following scenario:
- The cluster is running at the minimum configuration of one master node of type r5.4xlarge and two minimum worker nodes of type r5.2xlarge.
- Let’s say that the hourly On-Demand price for r5.2xlarge is said $10x and the Spot discount is, hypothetically, 90% or $1.
- A giant workload arrives, which triggers an upscale event and requires an additional capacity of one hundred r5.2xlarge worker instance types. This equates to 800 cores, and 6400 GB mem, adding $100 to the hourly cost.
- However, since r5.2xlarge, being a popular instance type, is not available in the AWS Spot Market, we have two options:
- Option #1: If fallback to On-Demand is enabled in the cluster, one hundred On-Demand nodes will be acquired, which will ensure that the cluster has sufficient capacity to meet the SLA of the workload and that it does not fail. This, however, will shoot up the hourly EC2 Cost to $1000.
- Option #2: If fallback to On-Demand is NOT enabled in the cluster, the cluster will keep retrying to acquire the desired Spot nodes. This inevitably introduces delays and failures, resulting in unpredictable costs and problems.
Qubole’s Heterogeneous Cluster Configuration capability is designed to address these potential problems automatically. With heterogeneous clusters, you can configure the cluster to have additional worker instance types.
Now, let’s say we add heterogeneous configuration to the same cluster as follows:
- Primary Worker Instance Type – r5.2xlarge
- Secondary Worker Instance Types – r5.xlarge and r5.4xlarge
- An xlarge instance has 4 cores and 30.5 GB memory with an AWS EC2 cost of $0.5x.
- A 2xlarge instance has 8 cores and 61 GB of memory with an EC2 cost of $1x.
- A 4xlarge instance has 16 cores and 122 GB of memory with an EC2 cost of $2x.
Now, consider the same scenario that we discussed earlier:
- The same giant workload arrives, triggering an upscale event. This requires an additional capacity of one hundred r5.2xlarge worker instance types, which equates to 800 cores, and 6400 GB mem.
- With the new cluster configuration with heterogeneous config enabled, Qubole will try to meet this need with one of the following alternatives:
- 200 r5.xlarge instances; or
- 100 r5.2xlarge instances; or
- 50 r5.4xlarge instances; or
- A combination of xlarge, r5.2xlarge, and r5.4xlarge instances equates to the same desired additional capacity of one hundred r5.2xlarge instances.
- If the primary worker instance type, r5.2xlarge, is not available in the Spot market, the Spot request will be fulfilled with secondary instance types r5.xlarge and r5.4xlarge. The more alternate worker instance types you configure, the higher the chances of getting Spot nodes.
- Additionally, all of the above options will add the same capacity (in this example, 800 cores and 6400 GB memory) and cost the same (in this example, $100x). All three amount to the equivalent of adding 100 nodes of the primary instance type (r5.2xlarge).
In conclusion, I have demonstrated how Qubole’s native Heterogeneous Cluster Configuration automatically maximizes the chances of getting Spot nodes, thus increasing our customers’ performance and reducing costs, while ensuring the system meets workloads SLAs.
In the next blog, I will cover how we optimized the customer’s Spark clusters for cost and reliability.
Sign up for a free 14-day trial to experience the power of Spark on Qubole.