From its primitive beginnings as a modest open-source search engine called “Nutch”, Hadoop has evolved into a powerful big data analytics platform. As big data technologies and policies rapidly advance, Hadoop is just getting started.
In a recent article on Computerworld.com, writer Robert L. Mitchell interviews IT leaders, consultants, and industry analysts to reveal “8 big trends in big data analytics.” Together these trends paint an exciting picture of where Hadoop is headed in the future.
1. Hadoop takes to the cloud
Originally designed to work on clusters of commodity hardware, Hadoop is quickly migrating to the cloud. As Forrester Research analyst Brian Hopkins points out, “Now an increasing number of technologies are available for processing data in the cloud.” Among the examples of cloud services mentioned in the article are Amazon’s Redshift, Google’s BigQuery data analytics service, and IBM’s Bluemix cloud platform. As Hadoop shifts to the cloud more and more companies are moving from an in-house database infrastructure to cloud-based warehouses, and that trend is set to continue. According to Brian Hopkins, “The future state of big data will be a hybrid of on-premises and cloud.”
2. Hadoop comes to the enterprise
MapReduce and other distributed analytics frameworks are evolving into distributed resource managers, “that are gradually turning Hadoop into a general-purpose data operating system,” says Hopkins. What that means for the enterprise is that as various workloads such as SQL, MapReduce, in-memory, stream processing, graph analytics, and others are able to run on Hadoop with adequate performance, more businesses will implement Hadoop as an enterprise data hub. According to Hopkins, “The ability to run many different kinds of [queries and data operations] against data in Hadoop will make it a low-cost, general-purpose place to put data that you want to be able to analyze”.
3. The rise of data lakes
In working with traditional databases, analysts had to first design the data set before they could enter any data. But according to Chris Curran, principal, and chief technologist in PricewaterhouseCooper’s U.S. advisory practice, “A data lake, also called an enterprise data lake or enterprise data hub, turns that model on its head.” Curran goes on to explain that a data lake allows all types of data to be dumped into a “big Hadoop repository” without designing a data model beforehand. As for analyzing and defining what data exists in the lake, Curran says that “People build the views into the data as they go along. It’s a very incremental, organic model for building a large-scale database.”
One drawback of the data lake model is that it calls for highly skilled people to use it. In answer to that challenge, Bill Loconzolo, vice president of data engineering at Intuit, says his company’s focus is on “democratizing the tools surrounding it to enable business people to use it effectively.”
4. More predictive analytics
Big data means more data. And Hadoop analytics platforms provide the processing power analysts need to “do very large numbers of records and very large numbers of attributes per record,” says Hopkins in the article, “and that increases predictability.”
For Loconzolo and Intuit, the interest lies in enabling “real-time analysis and predictive modeling out of the same Hadoop core.” But with Hadoop being slower at getting answers than other established technologies, Loconzolo says that’s been a problem. One solution to the speed problem that Intuit is currently testing is the large-scale data processing engine, Apache Spark, along with the query tool, Spark SQL. So far the tests look promising. “Spark has this fast interactive query as well as graph services and streaming capabilities,” says Loconzolo, “It is keeping the data within Hadoop, but giving enough performance to close the gap for us.”
5. Faster, better SQL on Hadoop
According to Mark Beyer, an analyst at Gartner interviewed for the article, while Hadoop allows smart coders and mathematicians to “drop data in and do an analysis on anything,” it’s too sophisticated to let business users do the same. “I need someone to put it into a format and language structure that I’m familiar with,” says Beyer. Fortunately, SQL for Hadoop products allows business users already familiar with SQL to apply similar techniques to that data. According to Hopkins, SQL on Hadoop “opens the door to Hadoop in the enterprise” by eliminating the need for “high-end data scientists and business analysts who can write scripts using JavaScript and Python.”
As Robert L. Mitchell explains in the article, these tools are not new. Apache Hive has offered a structured SQL-like query language for Hadoop for quite a while. But commercial alternatives from a number of vendors such as Qubole and others are offering much higher performance while getting faster all the time. Still, Hopkins cautions that Hadoop won’t be replacing data warehouses anytime soon, “but it does offer alternatives to more costly software and appliances for certain types of analytics.”
6. More, better NoSQL
According to Chris Curran, NoSQL databases are gaining momentum as popular alternatives to traditional relational databases for specific kinds of analytic applications. Curran estimates that around 15-20 open-source NoSQL databases currently exist, each with its own specialization.
One example is the article of a NoSQL database in action involves the use of sensors on store shelves to monitor products, how long customers handle them, and how long customers linger in front of particular shelves. “These sensors are spewing off streams of data that will grow exponentially,” Curran says. “A NoSQL key-value pair database is the place to go for this because it’s special-purpose, high-performance, and lightweight.”
7. Deep learning
Deep learning “is still evolving but shows great potential for solving business problems,” Hopkins claims. Using deep learning, Hopkins explains, computers can recognize “items of interest” in massive volumes of unstructured and binary data, deducing relationships without relying on specific models or programming instructions. As an example, Hopkins cites a deep learning algorithm that examined Wikipedia data and learned on its own that California and Texas are both states in the U.S. According to Hopkins what makes that learning method different from older machine learning is that “it did not have to be modeled to understand the concept of a state and country.”
In the future, Hopkins feels that big data will make use of deep learning techniques “in ways that we are now only beginning to understand.” According to Hopkins, “This notion of cognitive engagement, advanced analytics and the things it implies . . . is an important future trend.”
8. In-memory analytics
The last trend discussed in Mitchell’s article is the increasing use of in-memory databases to speed up the analytics process. Beyer points out that many businesses are already using Hybrid Transaction/Analytical Processing (HTAP) where both transactions and analytic processing reside in the same in-memory database. However, he cautions that HTAP has been overhyped and is consequently being overused, especially for users who need to see the same data multiple times in the same day and the data doesn’t significantly change. Bringing in an in-memory database, Beyer argues, means there’s another product to manage, secure, and figure out how to integrate and scale.” Loconzolo at Intuit basically agrees with Beyer. “If we can solve 70 percent of our use cases with Spark infrastructure and an in-memory system could solve 100 percent,” he says, “we’ll go with the 70 percent in our analytic cloud.”
Which of these Hadoop industry trends will take hold in the future? Only time, and perhaps predictive analytics will tell.
Big data in the cloud puts business intelligence first and technology second. With the power of a Hadoop cluster as a fully managed service, Hadoop as a Service makes using big data for marketing easy. Learn more about Hadoop in the cloud