If big data frameworks had a popularity contest, Apache Spark would be the attractive, trendy option everyone wants to be seen with. First developed at the AMPLab at UC Berkeley, Spark has become a widely adopted framework among organizations with lofty visions for Machine Learning (ML), sophisticated processing, and advanced analytics, among other big data projects.
Spark’s vibrant community of more than 1,700 contributors prioritizes agility, flexibility, and scalability through the 300 to 400 code commits deployed per month. And data professionals confirm the framework’s rising popularity: Spark is the leading big data framework in use and continues to experience growth, with a usage increase of 29 percent since 2017.
Why Spark?
Created to address the speed limitations of traditional distributed processing engines like Hadoop, Spark’s in-memory data engine provides a faster and more fault-tolerant way to run workloads. With Spark, you can concurrently process large amounts of data across clusters or servers without suffering speed issues or experiencing job losses. Due to the nature of in-memory processing, organizations are able to rapidly process large workloads for ML, streaming data, ETL (Extract, Transform, and Load), and batch and graph processing.
Data professionals also like Spark because the framework works for multiple use cases and multiple user groups. What’s more, data scientists, data engineers, and data analysts can leverage Spark in conjunction with their preferred coding language — whether R, SQL, Scala, Python, or Java — or alongside their BI or analytics tool of choice.
If Spark Is the Holy Grail of Big Data, Why Don’t All Organizations Use It?
As promising as Spark sounds for businesses looking to ramp up their big data initiatives, the framework can also pose significant challenges. Chief among these challenges is the strain Spark can place on your infrastructure and resources: oftentimes data teams spend significant time on routine maintenance requests and servicing internal customers. Poor resource utilization can impede or delay your success, because Spark — or any distributed data processing engine — is only as useful as your ability to optimize related resources and costs.
The key benefit of Spark is leveraging distributed computing to run big data projects faster and more efficiently. However, Spark hasn’t always been the most intuitive tool to use by virtue of being an open-sourced solution. Spark users face several performance and usability hiccups, which impacts the value Spark can deliver to your organization. And although Spark can be used for a variety of big data workloads and by multiple user groups, achieving that state of enterprise-wide adoption typically requires significant work on the part of the data team.
Maximize Cost Efficiency
Cost can be a critical obstacle to adoption of Spark, both in terms of infrastructure setup and usage costs. Qubole eliminates out-of-control costs with numerous platform features designed to help you control your cloud and Spark costs. Our customers have saved as much as 50 percent on their cloud costs, in part due to:
- Intelligent Spot management (AWS only): Qubole automatically bids on Spot instances and rebalances Spot nodes to maximize your low-cost instances without causing cascading failures.
- Workload-aware autoscaling: Our advanced SLA-based scaling algorithm determines the exact number of executors to optimize resource utilization.
- Aggressive downscaling: Qubole gracefully decommissions idle nodes or those that are no longer necessary, while container packing enables downscaling with higher resource utilization.
- Heterogeneous clusters: Qubole mixes different instance types in the same cluster, leading to significant cost savings and more reliable clusters.
Increase Processing Performance
Excelling with Spark in part boils down to performance — how quickly can you read, write, and process data? As a standalone open source solution, Spark is designed as a general-purpose engine for users to configure and optimize depending on their specific use cases. However, customizing Spark can be quite a daunting and time-consuming task. Luckily, Qubole includes performance optimizations that ensure stability and business continuity alongside built-in tools designed to increase processing efficiency:
- Direct writes: Qubole delivers faster write throughput when writing to S3, alleviating the need to stage writes and then commit them.
- Join optimizations: Qubole’s engine optimizations improve Spark performance for join operations on large data sets.
- RubiX: RubiX is Qubole’s platform-wide caching layer that uses local disks to improve the latency of read operations.
- Sparklens: Qubole’s open source framework delivers an added layer of performance enhancements to your Spark applications. Sparklens identifies critical paths and recommends the optimal configuration, such as tradeoffs between SLAs (Service Level Agreements) and the executor count.
Improve Usability for All Groups
Traditional data platforms are designed or optimized for a specific type of user such as a data scientist or data engineer. Alternatively, Qubole provides multiple interfaces built to optimize day-to-day processes for multiple user types including data engineers, data analysts, data scientists, and administrators.
- Multiple interfaces: Qubole offers numerous interfaces (Notebooks, Analyze, and REST API) for Spark users with different demands from the framework, such as for those wishing to build ETL pipelines, create machine learning models, or launch Spark jars.
- Workflow automation: With Qubole, you can schedule Spark jobs to run on a periodic basis as well as leverage Airflow to build end-to-end data pipelines using multiple jobs.
- Multi-user multi-workload Spark clusters: Qubole alleviates the need to create dedicated clusters for each user and workload, putting less stress on administrators and maximizing your cloud compute resources.
- Python and R package management: Out-of-the-box package management automatically distributes Python and R dependencies into the cluster when Notebooks are run.
- Spark Application UI: Users can track the state of the Spark application across the driver and all executors, which greatly simplifies and speeds up the debugging of problematic jobs.
Spark Users Love Qubole
We saw customers’ Spark commands grow 439 percent between 2017 and 2018, which reiterates the increasing value companies are achieving with Spark. This year Qubole customers have run clusters with more than 650 concurrent nodes and 300,000 invocations of Spark clusters. And there’s a reason why our customers love Spark on Qubole — with Qubole, they’re able to run some of the largest Spark clusters on the cloud without fear of job loss or out-of-control cloud costs.
See Qubole on Spark in action and learn more in our recorded webinar about accelerating Spark’s time to value with Qubole.