Each month, about an exabyte of data is processed using Qubole’s data platform on Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, for a variety of use cases spanning data engineering pipelines, data science/machine learning, and advanced analytics. This not only requires the vast computing power of the cloud and the proven autoscaling and automation of Qubole, but it also demands a continuous flow of enhancements, fixes, and new features to support the ever-increasing data processing needs of our customers.
To that end, release 57 (R57) brings many new capabilities and enhancements that help simplify and improve the efficiency and performance of your data processing projects. This blog provides highlights within each category—administration, data engineering, and data science—and links to further details.
Administration
Infrastructure Administration
Release 57 brings improvements in administration productivity, cost control, and reliability of the cloud infrastructure:
- Qubole’s advanced cluster management and low-cost compute capabilities have enabled customers to operate clusters at high ratios of low-cost compute instances—i.e. AWS’ Spot, GCP’s preemptible VMs, or Azure’s low-cost VMs. To support these types of configurations reliably, Qubole now allows separate master and worker node configurations for a cluster. For example, you can now run the master with an On-Demand instance, while the workers can run on 100% low-cost instances
- R57 brings improved support for AWS EC2’s Spot Blocks, since they provide higher reliability than Spot nodes, albeit at a slightly higher cost. In order to minimize any adverse impact on workloads or queries, Qubole can now proactively and automatically replace Spot Blocks with new instances that have longer time blocks before they expire.
- When operating on a shared infrastructure, query performance will vary based on the demand and health of the infrastructure. Therefore, R57 enables better planning, providing administrators with live cluster health metrics and cleanup activity from within the cluster details user interface, as well as end-user applications like Workbench.
Security Administration
Provide data access controls, role-based access controls, and governance controls:
- New Apache Ranger integration with Apache Spark for row and column-level data access controls —in addition to the existing Ranger support for Hive and Presto.
- Enhanced security for cluster orchestration through AWS PrivateLink and restrictive user privileges.
- Simplified user governance with directory services integration.
Data Engineering
R57 further simplifies data engineering tasks to support data restatements and achieve regulatory compliance:
- ACID Transactions support data lakes using Hive 3.1.1. (a project that we open-sourced). Read our technical blog and product blog for further details on reads using Presto and Spark, and row-level updates and deletes to address GDPR / CCPA requirements for Right To Erasure (RTE) and Right To Be Forgotten (RTBF).
Data Science
Release 57 delivers new applications, new debugging tools, and Incremental performance improvements:
- Integration of RStudio Pro with Spark clusters (ver. 2.2 and 2.3), in partnership with RStudio. This integration allows data scientists to use Rstudio Pro hosted on the cluster master, which is accessible from the Resources Link section on the clusters page.
- Support for public and private installations of Enterprise Github and Gitlab with Qubole Notebooks for development and Continuous Integration and Continuous Delivery (CI/CD).
- Qubole’s Spark tuning tool, SparkLens is now available in open source and also integrated into Workbench for profiling and optimizing Spark jobs. Read more here.
- Revised troubleshooting guides with updated tips; techniques to fix common issues; and new articles based on customer feedback. Read more here.
Cloud-specific Capabilities
Google Cloud Platform
- Custom Commit Plan on the GCP Marketplace, which allows you to purchase Qubole using private quotes, with customized terms of service and integrated billing with Google.
- Enhancements to Qubole’s support for Preemptible VMs, including the option to fallback to On-Demand instances, rebalancing, and automatic Preemptible node rotation for better loss-handling.
- Support for Presto 0.208 for data discovery and Petabyte-level queries, with the full support of Qubole’s native workload-aware autoscaling capabilities. Presto is now available from Workbench or JDBC drivers.
- Encryption support for data at rest based on Google’s Cloud Key Management Service; plus in-flight encryption between cluster nodes.
- The Qubole Control Tier is now available in the European Union to address data locality needs.
Microsoft Azure
- Support for Azure Data Lake Storage Gen2 from Presto, Spark, and Hive, including access controls based on each user’s Azure Active Directory permissions.
- The Qubole Control Tier is now available in the European Union to address data locality needs.
Congratulations to the Engineering and Product teams that made R57 possible, and to our customers, that provided invaluable feedback during the testing phases.
To learn more about R57 please refer to the Release Notes in our product documentation, and let us know what you think via the “Send Feedback” button on the top right of the Qubole user interface.