Co-authored by Somya Kumar, Member of the Technical Staff at Qubole.
Presto is an open-source distributed SQL query engine developed at Facebook. It’s built for interactive analytics queries, and like other Big Data processing engines such as Hive, Hadoop, and Apache Spark offered as a service on Qubole Data Services (QDS), Qubole also offers Presto as a service to query data stored in the cloud or on HDFS.
In this blog post, we will review the integration of the Presto Ruby client in QDS and also look at two Presto query execution paths–one that uses the Presto Java client and the other using the Presto Ruby client, including the performance benefits of one over the other.
Presto Ruby Client in QDS
QDS currently uses the Presto Java client packaged in Presto’s open-source distribution in order to interact with the Presto server. However, from our extensive experience working with Presto Java clients, we’ve learned that it can take up to 3 seconds just to invoke the JVM. This goes against the very nature of interactive and ad hoc queries that usually have strict SLA. In order to eliminate this client-side performance overhead, we explored other Presto clients and chose Ruby because it integrates well with our existing Ruby on Rails web application as well as provides other benefits outlined below.
We’ve created a Ruby gem by extending the open-source Presto Ruby client developed by Treasure Data. We chose monkey patching for this integration so we could take advantage of any open source updates as well as have the ability to add functionality specific to our needs.
One of the main benefits of using QDS is auto-scaling–it adds and removes cluster nodes dynamically and it automatically manages to compute and storage resources to keep up with Big Data application’s processing needs. To complement this functionality, Ruby client code was extended to start a Presto cluster, process queries, get logs and results in real-time from the Presto server, and update query statistics and session management information.
Furthermore, our implementation uses a SOCKS proxy to connect to the master node and the Ruby client uses the Faraday gem to interact with the server. Note: Faraday gem is an HTTP client library that provides a common interface among many adapters such as Net:HTTP, which supports HTTP proxy, but not SOCKS proxy. So we’ve extended the Faraday gem using monkey patching (see code below) to add support for the SOCKS proxy.
module Faraday class Adapter class NetHttp < Faraday::Adapter def net_http_connection(env) if proxy = env[:request][:proxy] #support for socks proxy added here if proxy[:socks] Net::HTTP::SOCKSProxy(proxy[:uri].host, proxy[:uri].port) else Net::HTTP::Proxy(proxy[:uri].host, proxy[:uri].port, proxy[:user],proxy[:password]) end else Net::HTTP end.new(env[:url].host, env[:url].port || (env[:url].scheme == 'https' ? 443 : 80)) end end end end
Presto Query Execution Paths: Java vs Ruby
To elaborate on the differences between the two clients, let’s review their query execution paths.
The “Old” flow (see diagram below) shows how Presto queries are currently handled in QDS using the Presto Java client. When a query is submitted via the QDS web interface or through APIs, the task is assigned to one of the worker processes–which invokes a Presto client JVM that interacts with the Presto server to carry out the query execution.
Compare that to the “New” flow which uses the Presto Ruby client wherein the worker process interacts with the Ruby client–which then initiates a REST API call to execute the query on the Presto server.
With this “New” architecture and flow that uses the Presto Ruby client, we’ve already seen the query execution lifecycle reduced by at least 3-4seconds just from the fact that the JVM invocation stage has been eliminated.
Another important benefit is greater scalability–with Presto Java client, a cluster can process only as many queries as the number of Java clients it can start in parallel before hitting memory limitations. On the other hand, because the Presto Ruby client runs within QDS (see diagram above) and not on the cluster like the Java client, it can scale out to keep up with the big data processing workloads.
Performance Boost
As a result of the optimizations, when comparing our Presto Ruby client to the packaged Presto Java client, we’ve noticed an increased performance boost for queries that normally have less than 10sec execution times.
These are the scenarios we took into consideration:
- DDL Query. For example, SHOW TABLES
- Query with a syntax error
Here are the results:
Presto Java client (wait time in secs) | Presto Ruby client (wait time in secs) | Improvement | |
---|---|---|---|
DDL Query | 4 | 1 | Ruby client is 4x faster than Java client when processing DDL queries |
Query with a syntax error | 4 | 1 | Ruby client is 4x quicker than Java client in reporting back errors generated by the Presto server |
Summary
Using Presto Ruby client in QDS provides better overall performance and therefore makes data analysts running interactive analytics queries through the web interface more efficient. It also makes data engineering applications that use QDS APIs more performant because API commands go through the same execution path as the QDS web interface.
We will be rolling out the Presto Ruby client in QDS for all accounts in the coming weeks. If you’d like to get started today, please email us at [email protected] and we’ll enable it for your account.