The big data boom has given rise to a host of vendors, each promoting their own unique ways of meeting the growing data demands of today’s businesses. As a result, businesses seeking a big data solution have a fairly long list of big data vendors to choose from.
Selecting the right vendor is both a business decision and a tech decision. And there are fundamental differences in how various vendors work to deliver on the big data promises of “actionable insights” and “competitive advantage” that the business and IT leaders of organizations need to be aware of. To that end, here is a comparative look at the main types of big data vendors.
Cloud vs. On-Premise
At the core of all big data offerings are the Hadoop analytics platform, existing either as a cloud-based or on-premise solution.
On-premise vendors provide a physical Hadoop platform comprised of large numbers of servers housed in a large onsite facility of the business. This can be quite expensive when you factor in the initial hardware and facility costs, software license and support costs, a large amount of electricity required, and the expense of keeping an on-site IT team on the payroll to ensure that all things on the hardware and software side keep running smoothly. In addition, as an organization’s data demands increase, more physical servers must be added to the cluster to meet those demands. This can be a costly and time-consuming process.
As for the main benefits of on-premise Hadoop, organizations have complete control over all systems and data. All corporate data is stored, handled, and secured internally behind the corporate firewall. Having a dedicated on-site IT staff to offer maintenance and support—within an environment where IT and business leaders can work closely together to make sure that the tech side is always aligned with the organization’s business objectives—is also a major plus. And while the initial investment may be substantial, the benefits of an on-premise solution can more than makeup for the costs over time.
Cloud vendors, aka Big Data as-a-service (BDaaS) vendors, offer businesses a more streamlined model than their on-premise competitors. Instead of investing heavily in expensive onsite hardware and support, organizations can contract to gain instant access to the cloud vendor’s own fully scalable storage and analytics platform, paying only for what they use. Since cloud vendor storage and analytics services are accessed online in an essentially “plug-and-play” format, installation and licensing costs are eliminated.
With many cloud vendors, businesses can opt to pay a monthly fee for services, or simply pay as they go. As an organization’s data demands increase, the total scalability of the cloud platform allows access to unlimited storage space on demand. Literally, thousands of virtual servers can be spun up in the cloud in a matter of minutes. And the organization pays only for the actual space and compute power that it uses. Cloud vendors are an attractive option for both large and small businesses looking for an affordable way to leverage their data for competitive advantage without having to do all of the heavy lifting themselves.
Proprietary vs. Open Source
As previously mentioned, the Hadoop platform is fundamental to all big data vendor offerings. Hadoop, by definition, uses open-source software, and a few vendors have chosen to capitalize on that. Other vendors have developed proprietary tools to enhance and build on Hadoop’s open-source components.
Open Source
Open-source vendors, such as Hortonworks, take the various components of open source Hadoop such as HDFS, Hadoop Common, Hadoop MapReduce, and Hadoop Yarn, and package them into one fully supported big data solution. Being that an open-source community of software developers supports this solution, businesses that contract with open source vendors will benefit from ongoing software improvements, refinements, and innovations. However, an open-source solution is far from being “out-of-the-box”, as it requires an on-premise data warehouse, the pros, and cons of which have been previously discussed.
Proprietary
While open-source Hadoop solutions are regarded as dependable and reliable at storing, managing, and analyzing massive volumes of structured and unstructured data—especially where time to insight is not a major factor—it takes the addition of differentiated, proprietary software tools to generate the kinds of actionable real-time insights that lead to competitive advantage.
Proprietary vendors offer solutions primarily based on the Apache Hadoop open-source distribution but with various levels of proprietary customization. For example, Cloudera—the first vendor to develop and distribute Apache-based software—also provides a proprietary Cloudera Management Suite to automate the installation process and reduce deployment time. Databricks is another vendor that has differentiated itself by offering proprietary stand-alone support for the Spark processing engine. Proprietary vendors, many of which are cloud-based, start where open-source Hadoop leaves off to take Hadoop to new levels of performance.
Proprietary Cloud vs. Public Cloud Data Storage
When it comes to storing big data in the cloud, organizations have two options: public or proprietary cloud vendors. Public Cloud Vendors, such as Amazon Elastic Compute Cloud (EC2), Google Cloud Engine, and Windows Azure Services Platform, are online services that make their applications and storage resources available to the general public via the Internet. These scalable services are either free or offered on a pay-per-usage basis. Public clouds deliver services to multiple organizations, and they are well suited for organizations with predictable computing needs where direct control over the environment is not required. Many vendors offer Hadoop services via a public cloud including Qubole, Databricks, and EMR.
Private Cloud vendors deliver many of the same advantages of public cloud, but they do it through a proprietary architecture. Private clouds tend to be more costly, as they are dedicated to a single organization. But they are better suited to organizations with mission-critical workloads where data security in a cloud environment is of major concern.