The history of the software industry is littered with examples of data management companies that lock customer data into their proprietary systems, and then go on to tax those very customers, the real owners of the data, for any type of data usage.
The newest and most brazen in this category of companies is Snowflake. The Snowflake Data Warehouse is a classic example of a proprietary system, designed to levy a “tax” to customers, once their data is locked in. We have seen this playbook before from Teradata, IBM, resulting in deathly lock-in and unsustainable costs for customers.
The Snowflake tax works in three ways:
- Proprietary storage: Snowflake stores data in a proprietary format making it cumbersome to use, especially in non-SQL workloads that are not supported natively. This means you have to pay Snowflake a data import/export tax if you want to use open-source data processing frameworks for non-SQL workloads, such as streaming analytics and machine learning.
- Fixed workflow: Snowflake forces SQL-based data processing for all workloads. While SQL may be the lingua franca for SQL workloads, machine learning & streaming analytics require programmatic distributed data processing frameworks like Apache Spark and TensorFlow, and languages such as Python, Java, and Scala—the preferred choice for Data Engineers and Data Scientists. For example, to run an Apache Spark job on Snowflake, you have to use their Apache Spark or JDBC driver to query their SQL engine to import data into a data frame, process this data using Apache Spark, and rewrite it into Snowflake. Not only is this complex, inefficient, and less performant, but just reading data from Snowflake results in additional costs through the Snowflake tax.
- ETL and Switching Costs: Snowflake wants to store all your data. This means you have to move data—even data that is already in your cloud object store—into the Snowflake repository. While data movement and sequential ETL make sense for Business Intelligence (BI) workloads, you need just-in-time and iterative data engineering for machine learning and streaming analytics. But the largest Snowflake tax is levied when you want to switch out of Snowflake—think bulk export of big data that results in an additional billing overhead. This reminds me of my favorite tune “You can check out any time you like but you can never leave.”
Much has been said about the possibility of data lakes replacing data warehouses or vice versa but I view them as complementary. At Qubole, we have worked with more than 300 market-leading companies such as Expedia, Disney, Adobe, Epic Games, and more, and we see the same architecture everywhere: a data warehouse for BI workloads, and a data lake for data engineering-heavy workloads, such as machine learning and streaming analytics. And the same architecture exists in the most advanced and data-driven companies of our time including Google, Facebook, Netflix, and Amazon.
Data warehouses such as Snowflake enable fast, complex queries across historical structured data. They help businesses learn from the past. For example, a retailer can understand which products have sold well in a particular region. The performance and ease of use benefits justify the data warehouse tax for this type of workload.
Data lakes on the other hand focus on workloads such as data engineering, streaming analytics, and machine learning on unstructured and semi-structured data—think video, social streams, and IoT sensor data, for example. Qubole’s open data lake platform natively supports technologies that data scientists use today for machine learning, like programming languages such as Python, Java, Scala, R, and open-source frameworks such as Apache Spark, Tensorflow, and more.
Data Lakes store data in an open format and make it accessible through open standards-based interfaces.
Data lakes prevent vendor lock-in.
Don’t put all your data into a data warehouse and repeat the mistake of vendor lock-in. Build a modern data platform that combines the power of a data warehouse and a data lake.