Seven years ago, Joydeep Sen Sarma and Ashish Thusoo were first introduced to big data technology. Now, in 2014 they are the guiding force behind one of the big data’s fastest-growing and innovative companies — Qubole. By leveraging their expertise in big data technology along with the ever-expanding capabilities of the cloud, Sen Sarma and Thusoo have created a truly unique company.
Sen Sarma and Thusoo both got their start in big data at Facebook. Thusoo headed the data infrastructure team at the social media giant, and Sen Sarma started the Apache Hive project and made vital contributions to the Facebook Message architecture. After Facebook, they joined together to co-found Qubole.
Qubole’s co-founders both run off the same supposition — big data in the cloud can drastically change and improve the world, whether it’s through business solutions, healthcare solutions, or any number of other important decisions that are big-data inspired.
A key component to the company’s success is the ease with which its services can be implemented. When a company uses Qubole’s services they get the best in big data technology and service without the hassle of maintenance and complicated software details. Qubole takes care of the dirty work on the road to Hadoop, allowing companies to maintain their focus on solving business questions, not big data programs.
In the following Q-and-A Sen Sarma and Thusoo open up about Qubole, its history and future, and how it’s making big data more accessible than ever.
Q & A with Joydeep Sen Sarma & Ashish Thusoo
Q: When did you first get involved with big data technology and how has your perception of the technology changed since then?
Sen Sarma: I first used Hadoop in 2007. I used to think it was primarily a Facebook/Yahoo/Google scale tool. I am continuously surprised by how ubiquitous large-scale data processing has become, and the level of innovation it has attracted as people invent faster and better tools for specific use cases. I am more bullish in this area (and cloud computing) than ever!
Thusoo: I first got involved in 2007. It was considered very novel and technology in search of use cases at that time. Now it is getting more and more mainstream, though there is still a long way to go.
Q: What challenges do companies face when they are adopting Hadoop, particularly from a cloud service provider?
Sen Sarma: In general, companies suffer from a lack of expertise in implementing big data solutions. Qubole alleviates a big part of this challenge, namely, automating Hadoop operations and providing higher-level operators around data movement and analysis. This enables companies to get going with their big data projects much faster. However, the skills and expertise required to go beyond this and build an end-to-end solution for businesses are still in short supply.
The Cloud offers a paradigm shift from an in-house data center, and this can be a surprise for some. For example, veterans may continue to define their architecture in terms of a fixed number of nodes, instead of thinking about an elastic architecture and its implications from the get-go.
Thusoo: Most companies lack operational expertise on Hadoop, and most cloud services are not truly self-managing. That is where true self-service products like Qubole can help. The other challenges include the lag in implementing Hadoop because of hardware provisioning, the lack of tools on top of Hadoop, as well as the complexity of integration of technologies to make it useful to a lot more people in the enterprise. This is also easily accomplished by a turnkey self-service product like Qubole
Q: What has been the biggest challenge for Qubole from either a client or technology perspective?
Sen Sarma: One of the things that took us a while to realize was that even though Qubole has made Hadoop dramatically simpler to use, it still takes time for our clients to model their business problems and implement them on a new big data platform.
Technologically, one of our biggest challenges is navigating our way through the profusion of open-source projects in the big-data landscape. It turns out that this is one of the core value adds we provide to our customers. Instead of figuring out what the best job scheduling technology is (Oozie or Azkaban) or which one of the faster SQL implementations to exploit (Presto/Impala/Shark), customers can rely on Qubole to evaluate all available alternatives and provide an implementation that works reliably at scale.
Thusoo: We are always motivated by the vision of building the best cloud-based big data service for our clients. And while it is a challenging goal and an audacious vision, we are enthralled by the prospect of how creating such a service and making big data infrastructure so accessible can transform companies, and industries and thus impact society in general.
Q: How does Qubole advise clients on security in data loss prevention and recovery?
Sen Sarma: One of the starting points in the cloud, of course, is to use trusted and proven systems like AWS S3 (and equivalent systems like Google Cloud Storage) for reliable data storage. Every cloud provider also provides its own identity and access management systems that Qubole integrates to provide a highly secure solution for data access. At the extreme, there are organizations that must keep all data at rest encrypted to be able to use public clouds (again a requirement that Qubole can help achieve). We work with clients to understand the level of security required by their domain and the right set of technologies that can help them achieve this.
Q: What was the original purpose for creating Apache Hive?
Sen Sarma: From a personal perspective, I wrote some fairly complex Java Map-Reduce jobs when I first started using Hadoop. It became clear to me very that it would not be possible for us to teach our fast-growing engineering and analyst teams this skill set to be able to exploit Hadoop across the organization. At the same time, we abhorred a setup where a small set of data pros would have to be called upon every time analysis of data in Hadoop was to be required.
We knew that SQL was an interface widely used by engineers (for developing applications) and analysts (for analysis) that would require little training. While SQL is powerful enough to suffice for most analytics requirements, we also loved the programmability of Hadoop and wanted the ability to surface it. Apache Hive was born out of these dual goals — an SQL-based declarative language that also allowed engineers to be able to plug in their own scripts and programs when SQL did not suffice. It was also built to store centralized meta-data about all the (Hadoop-based) data sets in an organization, which was indispensable to creating a data-driven organization.
Thusoo: We wanted to bring the power of Hadoop to more and more people, not just developers. Since SQL was a lingua franca for all things data and is taught heavily in school and used heavily in industry, by putting SQL on top of Hadoop we made Hadoop more accessible to data users.