The increase in volume, velocity, and variety of data, combined with new analytics and machine learning, has created the need for an open data lake architecture. An open data lake has become a standard feature alongside the data warehouse. While the data warehouse has been designed and optimized for SQL analytics, the need for an open, simple and secure data lake platform that can support new types of analytics and machine learning has driven the open data lake adoption. However, enterprises today are looking at considering the convergence of the data lake and data warehouse model.
Debanjan Saha, VP, and GM of Data Analytics services, including BigQuery, Dataflow, PubSub, Dataproc, Data Fusion, Composer, Catalog, etc. in Google Cloud, talks about the convergence model and how to bridge the performance gap while adhering to the openness of the data lake architecture.
Also Read: The Data Lake Summit – Day 1 Recap
The Data Lake Summit – Day 2 Recap
What are your views on the data lake? How does it compare and contrasts with data warehousing?
People store data in various places like warehouses, object storage, databases, on-premises, cloud, or multiple clouds. But they would want to bring all these data under one management and essentially view them as one pool of data on which they can run different types of analysis, and machine learning models. And I believe that is the right way of thinking about a data lake, which means that your data could physically be present everywhere, but you have one place or system through which you can manage the pooling of data assets.
Back in the 2000s or early 1990s, SQL analytics was the vogue, but today we see a diversity in analytics that is also a part of the data lake system. How do you view the diversity in terms of how you consume this data? And what type of analytics can you run on top of that?
Many people think data lake is primarily a storage play, on which they can use various types of analytics or machine learning models. But the way I think about it is you’ve all your data assets under one management, and on those data assets, you can run various types of analytics or ML engines like Presto or BigQuery, or Spark. And that makes the data engine more flexible and useful in many cases because it is a multi-purpose system. A data lake is very different from a data warehouse, where storage and analytics engine are tied together as one appliance.’
The data warehousing industry has been around for a long time, and they are also evolving with this plurality of analytics that we see today. How do you see that playing out on one side there is a data lake paradigm where you store data in an open format and plug in different types of analytics engines on the data lakeside. On the other hand, the existing paradigm of a data warehouse has coupled the format with the processing side. There is an evolution on both sides, but how do you see playing out as things evolve?
The data warehouse vendors are gradually moving from their existing model to the convergence of the data warehouse and data lake model. Similarly, the vendors who started their journey on the data lakeside are now expanding into the data warehouse space. Convergence is happening from both sides. For instance, BigQuery is now letting organizations query data on Amazon S3.
Similarly, data lake platforms like Databricks and Qubole are now decisively moving toward data warehousing use cases. You can have managed storage with ACID properties, transactional consistency, snapshot, etc., and integrate the query engine more with the storage management and create a lake house paradigm for their customers. We are doing the same thing with Dataproc. For example, with Dataproc, you can run SQL engines and use data in BigQuery. Convergence between data lake and data warehouse is not just in talks but in reality.
What are some of the key things that you think will transpire in either the data lake architecture or data warehousing architecture as a convergence? Are governance and security still open issues?
Well, governance is a big open issue, but there are other open issues like the search and discoverability of data. For example, there are a lot of data stored in data lakes of which people are not aware, and discovering those data abilities to search the data is something many of our customers are interested in.
Once you have discoverable and searchable data, you want to make sure that you give the right access rights to people who wish to use them. In data lake architecture, governance is done at a coarse granularity level, which is not enough for sophisticated applications that fine-grain access control in terms of row-level or column-level access control, which many of our customers would like to do so.
Another area that I think will be important is to bring all the data assets under one management. We are focusing heavily on a multi-cloud strategy. And I think people at some point are going to be interested more in the performance aspect. The performance of data warehouses, especially for SQL queries is often much better than SQL queries in data lake environments. That’s because of the tied management of storage and various layers of caching in data warehouses, and people will demand the same thing for data lakes. Having tied management of storage both from the transaction perspective and performance perspective will be important going forward.
How do you bridge the performance gap while still adhering to the openness in terms of data formats that a data lake has?
It’s more than just a data format. I think it is a tight integration of storage and the query engine and a layer or multiple layers of caching, which is important to match data warehouses’ performance. With respect to storage format, as long as you can import and export data into an open format, that is a good starting point. For example, BigQuery has its own data storage format, which allows us to do core performance and access control.
We enable customers to import and export data in an open format, which keeps the movement of data between the data warehouse and data lake simple. But over time, if you look at some of the data formats like Parquet, they can probably improve to give people fine-grained access control that is native in BigQuery data format. We are very open to this and are also working with other partners to develop a new data format to give people that opportunity. Apache Arrow could be an excellent intermediate layer between persistent data in object stores and other stores where they are stored and the Query engines that process them at some level of in-memory processing.
Should a data lake today be developed on the cloud or should it be an independent construct? And what are the key drivers of multi-cloud or hybrid cloud?
In terms of direction, it is very clear; that things are moving to the cloud. If you want to build a data lake that is not integrated with the cloud, you are building a data lake for on-premises only, which will not be successful. However, not everything will move to the cloud in one day or one year, it is going to take time, and it is essential to have a strategy that transcends the cloud and includes on-premises. Otherwise, it will be a limited set of use cases where people have to move everything to the cloud first before running their analytics. I don’t think we want to be in that state.
There are various reasons people use multiple clouds like regulatory reasons or various departments within an organization having an application running in a different cloud or sometimes through M&A people acquire assets in other clouds. Sometimes, there are different sets of capabilities available in the cloud. It is evident that at this point, every customer you talk to has footprints in every cloud. And it is important to have a data lake that can handle data in different clouds and on-premises.
How should the data lake construct evolve and keep the data lake flexible and simple to the picture? What are your thoughts on that?
There are different dimensions to it. I don’t think simplicity and flexibility are mutually exclusive. It is imperative to understand the domain-specific use cases and encode them on a flexible infrastructure so that people don’t have to go through multiple options and could easily build their domain-specific data lake quickly and improve their time to value postures, which is very important in the post-covid-19 era. We should create flexible templates to help the organizations start their journey at 70 percentage points rather than zero percentage points in terms of their maturity. This is something we are very interested in, and we are working towards it.
Would you like to share your parting comments about data lakes?
Well, I think the world is moving to an open data lake, and the convergence of data warehouses and data lakes will happen. I believe that three-five years from now, we will see people essentially pulling all of their data assets together under single management across multiple clouds and on-premises. And due to the pandemic, the journey toward clouds and datalake is accelerating.
Watch the full Keynote session