Data Lake Essentials, Part 3 – Data Lake Data Catalog, Metadata, and Search
In this multi-part series, we will take you through the architecture of a Data Lake. We can explore data lake architecture across three dimensions
- Part I – Storage and Data Processing
- Introduction
- Physical Storage
- Data Processing ETL/ELT
- Part II – File Formats, Compression, and Security
- File Formats and Data Compression
- Design Security
- Part III – Data Catalog and Data Mining
- Data Catalog, Metadata, and Search
- Access and Mine the Lake
In this edition, we look at Data Catalog, Metadata, and Search.
Key Considerations
Any data lake design should incorporate a metadata storage strategy to enable business users to search, locate and learn about the datasets that are available in the lake. While traditional data warehousing stores a fixed and static set of meaningful data definitions and characteristics within the relational storage layer, data lake storage is intended to support the application of schema at read time with flexibility. However, this means that a separate storage layer is required to house cataloging metadata that represents technical and business meaning. While organizations sometimes simply accumulate content in a data lake without a metadata layer, this is a recipe for an unmanageable data swamp instead of a useful data lake. There is a wide range of approaches and solutions to ensure that appropriate metadata is created and maintained. Here are some important principles and patterns to keep in mind. The single data set can have multiple metadata layers dependent on use cases. e.g. Hive Metastore, Apache Glue, etc. The same data can be exported to some NoSQL database that would have a different schema.
Enforce a metadata requirement
The best way to ensure that appropriate metadata is created is to enforce its creation. Make sure that all methods through which data arrives in the core data lake layer enforce the metadata creation requirement; and any new data ingestion routines must specify how the meta-data creation requirement will be enforced.
Automate metadata creation
Like nearly everything on the cloud, automation is the key to consistency and accuracy. Wherever possible, one should design for automatic metadata creation extracted from the source material.
Prioritize cloud-native solutions
Wherever possible, use cloud-native automation frameworks to capture, store and access metadata within your data lake.
Metadata Searching
A solution like Alation is one of the examples of a data catalog that allows searching against the metadata – e.g Which one is the hottest table in the store?
Data Lake – Access and Mining the Lake
Schema on Read
‘Schema on write’ is a tried and tested pattern of cleansing, transforming, and adding a logical schema to the data before it is stored in a ‘structured’ relational database. However, as noted previously, data lakes are built on a completely different pattern of ‘schema on read’ that prevents the primary data store from being locked into a predetermined schema. Data is stored in a raw or only mildly processed format, and each analysis tool can impose on the dataset a business meaning that is appropriate to the analysis context. There are many benefits to this approach, including enabling various tools to access the data for various purposes.
Data Processing
Once you have the raw layer of immutable data in the lake, you will need to create multiple layers of processed data to enable various use cases in the organization. These are examples of the structured storage described earlier in this blog series. Typical operations required to create these structured data stores involve:
- Combining different datasets (i.e. joins)
- Denormalization
- Cleansing, deduplication, householding
- Deriving computed data fields
Apache Spark has become the leading tool of choice for processing raw data to create various value-added, structured data layers.
Data Warehousing
For some specialized use cases (think high-performance data warehouses), you may need to run SQL queries on petabytes of data and return complex analytical results very quickly. In those cases, you may need to ingest a portion of your data from your lake into a column store platform. Examples of tools to accomplish this would be Google BigQuery, Amazon Redshift, or Azure SQL Data Warehouse.
Interactive Query and Reporting
There are still a large number of use cases that require support for regular SQL query tools to analyze these massive data stores. Apache Hive, Presto, Amazon Athena, and Impala are all specifically developed to support these use cases by creating or utilizing a SQL-friendly schema on top of the raw data.
Data Exploration and Machine Learning
Finally, a category of users who are among the biggest beneficiaries of the data lake is your data scientists, who now have access to enterprise-wide data, unfettered by various schemas, and who can explore and mine data for high-value business insights. Many data scientists’ tools are either based on or can work alongside Hadoop-based platforms that access the data lake.
Data Lake Platform – features to look for
Your Data Lake Platform Should Offer:
- Multiple Data processing engine options such as Spark, Hadoop/Hive, Presto, etc. This is essential to be able to support a wide array of use cases.
- A Metastore anchored on open standards, such as Hive which can then be used from Hive, Presto, and Spark SQL
- Catalog integration with AWS Glue.
- Support for AIR (Alerts, Insights, and Recommendations) that can be used for getting useful information from the Metadata
- Support for Kafka Schema registry (for Streamed Data Sets).
- Connectors to Data Warehousing solutions such as Snowflake, Redshift, BigQuery, Azure SQL Database, etc.
- Connectors for popular commercial databases like MySQL, MongoDB, Vertica, SQL Server, etc.
- Serverless computing options (e.g. Presto) to cost-effectively meet interactive query requirements.
- A Unified browser-based UI for Analysts to run their queries.
- JDBC/ODBC drivers to query from BI tools like Tableau, Looker, Click View, SuperSet, Redash, etc.
- Jupyter/Zeppelin notebooks for data scientists and analysts.
- UI-based data science package management for Python and R.
Missed Part 2? Data Lake Essentials, Part 2 – File Formats, Compression And Security
Additional References
Blogs
- AIR: Data Intelligence in Qubole
- Qubole Now Supports Glue Data Catalog to Run ETL, ML, and Analytics Jobs
- Package Management, or: How I Learned to Stop Worrying and Love Dependencies
Developer Resources
- Connecting to a Custom Hive Metastore (Qubole Product Docs)
- The Qubole Analyze User Interface (Qubole Product Docs)
- Qubole Partner Drivers (Qubole Product Docs)
- Qubole Notebooks (Qubole Product Docs)
- Using Qubole Package Management (Qubole Product Docs)
- Introduction to Qubole Streaming Analytics (Qubole Product Docs)
Partner Products
Conclusion
In this blog, we’ve shared major components of the data lake architecture along with Qubole’s solutions for each of those. We encourage you to continue your journey with a Qubole test drive!