Every data source — from the sensors in cars and machinery to the devices we carry around in our pockets — constantly records and transmits data amounting to 2.5 quintillion bytes per day. With the influx of information, enterprises are scrambling to capture all of the right data points, organize and process that data, and convert it into actionable insights. Yet in the hustle to transform raw data into usable information, we tend to forget how sophisticated and challenging the process between the input and output has become.
We’ve seen the infrastructure to process data become far more complex, in part due to the seismic shift in the diversity of data that businesses collect today and the variety of analyses to which data is subjected. Data lakes now coexist with traditional data warehouses — the result of relatively simple methods of decades past morphing into an elaborate process involving many different components, all of which interact with one another.
This patchwork infrastructure creates its own set of challenges. While self-service access to data stored in data warehouses has become a reality with the evolution of business intelligence platforms, it remains challenging for enterprises to deliver the same type of access to data stored in data lakes. Today, providing self-service access to data stored in data lakes means mastering open source software, data operations, infrastructure management, security, and supporting the variety of ways different data users within the enterprise access data.
Business leaders and data teams now need to consider many moving parts when giving users self-service access to data lakes, and to think about their data processes from a holistic perspective. Neglecting any of the puzzle pieces can have a lasting effect far beyond the current and future state of your infrastructure — it can impact your company’s overall productivity and your bottom line.
Resolving Data Operations Challenges
Data teams today face an endless flood of data sources and are finding it impossible to keep up with the operational demands of such an extensive data intake. The proliferation of structured, unstructured, and semi-structured data formats requires teams to continuously assess data quality prior to publishing data sets for enterprise data consumers. Teams expend significant time and resources preparing and ingesting data generated from sources within the company’s control as well as data provided by third-party vendors.
Not only is the quality of data critical, but delivering that data in a timely manner to analysts, data scientists, and software development teams is equally important. Data delayed is data denied. As a result, data teams must be able to measure SLAs and incorporate early fault detection, fault remediation, and high-availability practices into data operations. Furthermore, any industrial-grade data operations practice must involve tracking and managing the costs of operating the data lake infrastructure as well as driving visibility into how costs are allocated across the organization and different data projects.
Simplifying Infrastructure Management
Unlike previous decades where data storage and data processing were tightly connected, the emergence of standard formats has led to storage formats becoming decoupled from specific analytical processing engines. Different analytical engines can now process the same data set without having to replicate the data in repositories for each type of processing.
For instance, many companies now leverage more than one of the leading big data engines or frameworks: Presto, Apache Spark, Hive, Hadoop, TensorFlow. Qubole’s recently published Big Data Activation Report uncovered 162 percent growth in usage across Spark, Presto, and Hadoop/Hive in the span of one year (from 2017 to 2018). What’s noteworthy is that this growth was not driven by a single-engine — Presto usage grew by 420 percent, Spark by 298 percent, and the remainder stemmed from the growth of Hadoop/Hive usage. Data teams use these technologies to conduct different types of analysis on the same data set stored in Parquet, ORC, JSON, or other open formats. While this has simplified data pipelines since data does not need to be replicated into many different monolithic systems (data warehouses, graphical databases, machine learning systems, etc.), it has also made managing the data infrastructure much more complex.
Data teams are also experiencing increasingly diverse use cases driven by an ever-increasing number of users within the enterprise seeking access to a growing variety of data. This scenario places an onus on the infrastructure to continue to evolve to meet these growing and changing requirements. Predicting data demands on the infrastructure has become impossible, which leaves the data team playing a never-ending game of catch-up and missing SLAs.
Securing Enterprise Data
Data security remains difficult for enterprises to successfully execute — a problem compounded by the breadth of protection that today’s vast volumes of data required. Regardless of how sophisticated the security policies and controls being implemented are, organizations not in the big data business will struggle to keep pace with the costs and resources that a comprehensive security strategy demands. The proliferation of security tools and frameworks alone calls for dedicated staff and the resources to provide constant security testing, patching, penetration testing, and vulnerability assessments.
Extensive oversight of access controls is also crucial for identifying security vulnerabilities and preventing breaches. Your data administrators need the means to control access across all processing engines and frameworks not only to ensure privacy but also to conduct auditing and guarantee compliance. Encryption is yet another essential component of your security infrastructure to help safeguard the privacy of your data assets at rest and in transit. And with so many tools and technologies living in the cloud, it’s vital that your team employ and maintain appropriate encryption methodologies to protect against security infiltrations.
Supporting End-User Tools and Interfaces
Modern enterprises are dependent on many tools and technologies, each with its own interfaces. This problem is confounded by the fact that different tools are designed or optimized for specific users, whether data scientists, data engineers, data analysts, or administrators. Data scientists are concerned with having tools that enable machine learning, provide end-to-end visibility of the data pipeline, and automate the mundane tasks of scaling the data infrastructure. Data engineers require tools to optimize data processing tasks and streamline the production of data pipelines, while data analysts need a way to discover, query, and visualize all data formats. Data administrators, on the other hand, are focused on configuring, managing, and governing data and infrastructure resources.
Building self-service access to a data lake means that all of these tools and interfaces must work well with the data lake infrastructure. It becomes the data team’s burden to ensure interoperability between these tools and the data lake infrastructure meets the desired needs of the personnel using the tools. For example, the latency of ODBC/JDBC connectivity with the BI tool of choice is important for data analysts, while data engineers and administrators must be able to expose metadata about their workloads to the tools they use.
Although this complexity affects interactions between the data lake infrastructure and related tools, data teams must also set up and operate the tools necessary for diverse data users. Data scientists might be clamoring for a Jupyter- or Zeppelin-based notebook, while an analyst wants to use Looker or Tableau. The diversity of tools means that the data team now has to operate and manage both the Tableau implementation and Jupyter notebooks service.
Embracing Open Source Software
As companies expand their big data initiatives — across machine learning, artificial intelligence, business intelligence, ETL (Extract, Transform, Load), and future workload types — they must likewise expand their toolkit, as some tools will perform optimally for certain workloads but less so for others. For instance, Presto performs best for interactive analyses and data discovery while Apache Spark works well processing memory-intensive workflows for the purpose of creating data pipelines or implementing machine learning algorithms.
Presto and Spark, along with many other open-source big data tools (Hive, Hadoop, TensorFlow, Airflow, etc.), deliver unique and powerful crowd-sourced capabilities that companies can leverage for their big data initiatives. And yet these open source options require a greater commitment than one unfamiliar with technology stacks might believe.
Any company interested in or currently using open source engines must be constantly plugged into the open-source community. Each open-source tool has its own group of members who actively enhance the technology, so it’s critical that you have the means to monitor changes and up-and-coming advancements to determine their value for your environments as well as decide on the right timing and process for upgrades or applying patches. For every company leveraging big data this problem is amplified by the use of multiple open source tools and frameworks. A recent Qubole report found that 76 percent of enterprises actively use at least three engines for their big data workloads — requiring you to juggle community activity, maintenance, and performance tuning across all of those workload types.
Closing Thoughts
As a result of all of these changes, enabling self-service access to data stored in data lakes has become increasingly complex. Delivering such a sophisticated infrastructure to data users within the enterprise places an enormous burden on data teams. As a result, many installations fail — 85 percent of all big data projects, in fact. This failure holds companies back from effectively using data — the most important asset that they have today. In the subsequent blog posts of this series, I will outline how data teams can get ahead of this game and provide adequate infrastructure and support to their data users — and, in the process, take steps toward becoming a data-driven company.