Streaming Data Processing - reducing time-to-insight

May 5, 2020 by Prateek Shrivastava and Jorge Villamariona Updated March 26th, 2024

What’s keeping streaming data processing investments from yielding “speedy” results?

There are multiple streaming data processing solutions out there but none are well equipped to address the challenges keeping us from obtaining speedy insights from streaming data. Businesses are continually investing in multiple software solutions and stitching together complex data integrations in order to achieve quicker time-to-value, such as identifying the right target audience in real-time for serving ads or offers. This challenge plays out across three different types of tools:

Dedicated Stream data Integration tools that provide efficient ways to build streaming data pipelines with a wide array of connectivity options but do not integrate easily with the underlying data platform. These integrations also make it difficult to maintain the integrity of a streaming pipeline as they do not surface key events from the underlying streaming platform or engine leading to debuggability and maintenance issues. These are good productivity tools but provide no mechanism for cost optimization, improving performance, or management of pipelines (upgrading application and platform while in use)
Open Source streaming facilities such as Apache Kafka, Apache Spark, and Apache Flink focus on performance and scale ( low latency, high throughput, and high reliability). However, data teams are not able to make full use of these engines because experimentation, composition, and debugging are still challenging and tedious tasks.
Tools built by public cloud providers offer their own proprietary flavors of stream processing platforms but they have serious disadvantages such as getting locked into closed ecosystems.

These are the reasons why many organizations are still struggling to lower Time-to-Insight (illustration below)

In its annual survey, Gartner reports that 47% of organizations need streaming data to build a digital business platform, yet only 12% of those organizations reported that they currently integrate streaming data for their data and analytics requirements. This leads to missed opportunities as digital businesses require enterprises to connect, manipulate and harvest data at the scale and speed of changes in business (see “Adopt Stream Data Integration to Meet Your Real-Time Data Integration and Analytics Requirements”).

After assembling a stream processing stack with a combination of open source technologies and vendors, there are still engineering challenges hindering the success of the project.

Ease of experimentation:
To experiment, in most cases you won’t need to set up Kafka or Kinesis as this would delay POVs. Data arrives steadily as files on public cloud storage and the streaming job continues to process the data as it arrives, incrementally.
There aren’t many open source connectors available and of these, some are not reliable. It is difficult to provide exactly-once guarantees with these endpoints and handle back-pressures.
Standards Constraints in creating long-running pipelines are extremely high for business users and they are looking for a self-serve approach that is fast and easy without having to worry about learning new programming languages and the complexities of a streaming platform.
Data Accuracy and Consistency
In long-running pipelines, there are data consistency, accuracy, and quality issues due to the arrival of data in no particular order, and schema evolution. The resulting mismatches lead to broken pipelines and/or loss of data.
It is difficult to achieve data consistency between sources and sinks for a long period of time and it’s hard to notice and account for event drops. Engineers come to know of it long after it has occurred. Hence, engineers have to rely on expensive Lambda architectures and build redundant ingestion pipelines. This increases costs by at least 2x.
Fixing these issues would be a significant improvement on top of the generic Lambda Architecture. This would result in lowered demand for re-processing, lower compute requirements, and lower overhead related to mergers.
Replay/Reprocess Data
Sometimes models get updated or business logic changes, which require users to edit the pipelines, make changes and re-play from a particular checkpoint. Today, data engineers need to build offline batch tasks to get the desired output but sometimes even that may become too complex if there are live outcalls involved to enrich the data.
Also, tinkering with the checkpoint is too risky and can compromise pipeline stability especially if one is unsure of the approach one is taking.
Higher Performance
Managing state in stream-stream joins, aggregations, etc, significantly impacts performance and reliability. Open Source Spark’s default implementation is based on the JVM memory of executors, which quickly builds pressure causing garbage collection pauses, and eventually blocking the pipeline. In these scenarios, it is difficult to pre-empt a failure (you won’t know when it is coming) and take proactive measures.
Lower TCO
Spark Structured Streaming pipelines are long-running and hence costs can quickly spiral out of control. Users do not know what an ideal cluster size is or what number of executors to use to guarantee business-level SLAs. Hence, they oscillate between over-provisioning and under-provisioning streaming clusters.
Attributes of Data Processing
The challenge is to make downstream analytics faster, to reduce overall time-to-decision. Stream processing solutions must process and write enriched data into correct partitions, data formats, and optimal file sizes. Too many small files hamper performance on downstream SQL analytics or machine learning.
Portability
Businesses need a way to minimize risk. They also need a way to serve multiple customers with varying preferences for cloud vendors. Hence, an increasing number of companies are adopting a multi-cloud strategy. They are asking for the same set of tools based on open source technology stacks, which provides them the ability to seamlessly port existing pipelines from one cloud to another without a major rewrite.
If any of these problems resonate with you or if you would like further information on how Qubole can help address these challenges contact us and we will connect you with our product team. For a free 14-day test drive of Qubole, click here.

Start Free Trial

Lower Time-To-Insight: the elusive streaming data processing goal

What’s keeping streaming data processing investments from yielding “speedy” results?

Ease of experimentation:

Data Accuracy and Consistency

Replay/Reprocess Data

Higher Performance

Lower TCO

Attributes of Data Processing

Portability

Recent Posts

Categories

Read What is an Open Data Lake?

Product

Company

Helpful Links

START YOUR FREE TRIAL OF QUBOLE

Contact Form

On-Demand Qubole Demo

Google Cloud Sessions

Thank you!

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

UNLOCK QUBOLE FOR FREE

Lower Time-To-Insight: the elusive streaming data processing goal

What’s keeping streaming data processing investments from yielding “speedy” results?

Ease of experimentation:

Data Accuracy and Consistency

Replay/Reprocess Data

Higher Performance

Lower TCO

Attributes of Data Processing

Portability

Recent Posts

Categories

Read What is an Open Data Lake?

START YOUR FREE TRIAL OF QUBOLE

Contact Form

On-Demand Qubole Demo

Google Cloud Sessions

Thank you!