Which Big Data Framework is the Best Fit?
Apache Hadoop wasn’t just the “elephant in the room”, as some had called it in the early days of big data. Hadoop was in the room. But that is all changing as Hadoop moves over to make way for Apache Spark, a newer and more advanced big data tool from the Apache Software Foundation.
There’s no question that Spark has ignited a firestorm of activity within the open-source community. So much so, that organizations looking to adopt a big data strategy are now questioning which solution might be a better fit, Hadoop vs. Spark, or both? To help answer that question, here’s a comparative look at these two big data frameworks.
What is Hadoop?
Although Hadoop got its name from a toddler’s toy elephant, Hadoop should be thought of as a workhorse. Fundamentally, Hadoop is a parallel data processing platform that uses open-source software, a distributed file system (HDFS), and the MapReduce execution engine to store, manage, and process very large data sets in parallel across distributed clusters of commodity servers. Being that the Hadoop MapReduce framework and the HDFS both run on the same set of nodes, the Hadoop framework can effectively schedule compute tasks on nodes where data is already being stored. This ability results in high aggregate bandwidth across the cluster, enabling Hadoop to do the heavy lifting by processing vast data sets in a reliable and fault-tolerant manner.
For organizations considering a big data strategy, Hadoop has many features that make it worth considering, including:
- Flexibility – Hadoop can handle the multiple data formats that makeup today’s big data.
- Scalability – Hadoop can scale up or down to accommodate small to very large workloads.
- Affordability – Hadoop levels the playing field by allowing even small organizations with modest budgets to reap big data benefits.
Hadoop’s Biggest Drawback
With so many important features and benefits, Hadoop is a valuable and reliable workhorse. But like all workhorses, Hadoop has one major drawback. It just doesn’t work very fast when comparing Spark vs. Hadoop. That’s because most map/reduce jobs are long-running batch jobs that can take minutes or hours or longer to complete. On top of that, big data demands and aspirations are growing, and batch workloads are giving way to more interactive pursuits that the Hadoop MapReduce framework just isn’t cut out for.
For many organizations, a better big data strategy might be to keep the workhorse in the barn and bring out the racehorse, which is where Apache Spark comes in.
What is Apache Spark?
Developed at UC Berkeley’s AMPLab in 2009 and submitted to the Apache Software Foundation in 2013, Spark is a scalable open-source Hadoop execution engine designed for fast and flexible analysis of large multiple format data sets—with an emphasis on the word “fast”. Compared to Hadoop MapReduce, Spark runs programs 100 times faster in memory and 10 times faster for complex applications running on disk.
Spark doesn’t just process batches of stored data after the fact, which is the case with MapReduce. Thanks to a feature called Spark Streaming, Spark can also manipulate data in real-time, allowing for fast interactive queries that finish within seconds.
Spark has another advantage over MapReduce, in that it broadens the range of computing workloads that Hadoop can handle. Spark on Hadoop supports operations such as SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. Spark also enables these multiple capabilities to be combined seamlessly into a single workflow.
It should be pointed out that Spark does not include its own system for organizing files in a distributed fashion. But that’s not a problem. Since Spark is one hundred percent compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system, virtually all of an organization’s existing data is instantly usable in Spark.
Hadoop’s Evolution
Once a basic open-source platform, Hadoop is evolving into a universal framework that supports multiple models. As a result, organizations now have the option to use multiple big data tools instead of having to rely on just one. Among the many tools found with Spark in the big data stable are NoSQL, Hive, Pig, and Presto. And each tool is designed with a specific use case in mind. Armed with the right tool(s) for the right job, organizations both large and small can leverage the power of Hadoop to mount successful big data strategies.