A 2018 Gartner article discussed the necessity of data lakes when it comes to implementing big data, stating “the fact remains that more than 80 percent of all data is unstructured. As more businesses turn to big data for future opportunities, the application of data lakes will rise.” This is a message that cloud companies like Amazon Web Services (AWS) embraced early on, focusing on infrastructure that could service architecture models with a large variety and volume of data as well as bursty and unpredictable compute needs.
This market trend has evidently caught on. Average IT salaries increased by 30 percent in 2016 for those with AWS experience, large enterprises like Capital One are embracing a cloud-first strategy, and even government agencies such as the FBI are moving data to AWS for advanced analytics.
Modernizing Your Data Strategy
A number of primary drivers are propelling organizations to migrate data platforms from on-premises or co-located data centers to the cloud: enabling self-service access, efficiency in scaling, maximizing resources, and improving cost management with the pay-on-the-go elastic nature of the cloud. All of these capabilities enable organizations to become more data-driven, spend smarter, and go to market with new initiatives in days rather than months — all of which have caught the attention of executives.
Cloud providers such as Amazon, Google, Microsoft, and others are poised to help the industry shift from legacy operations and hardware that can be both expensive and limiting on data initiatives. However, the market has also been overrun with excessive noise about similar services and naming conventions.
What organizations need is a clearly defined plan and established expectations across teams that can help transition users and use cases pragmatically. With this method, companies can avoid unexpected costs while showing clear evidence of benefits or opportunities received against the spending.
While there is definitely room for multiple infrastructures (on-premises, hybrid, and cloud-only), building a data lake is a fundamentally different challenge in the cloud. Separating storage from compute tends to be one of the largest leaps for organizations when moving to a cloud data platform. When it comes to managing costs, billing on cloud services is almost entirely on-demand and usage-based. Companies also need to think about their users and the level of both data and tool access they need to have.
Operationalizing the Data Lake
To help data and platform leaders, both beginners and experts, we have collaborated with O’Reilly Media to bring to you the “Operationalizing the Data Lake” book. This comprehensive piece is authored by big data evangelists Jon King and Holden Ackerman and includes over a dozen contributors from every department and role in a data team — from users like data scientists and analysts to platform and security engineers.
As Jon King writes:
Storage and compute limitations meant we were having to constantly decide what data we could and could not keep in the warehouse and schema evolutions meant we were frequently taking long maintenance outages…[thus we built a cloud data lake], realizing that we needed a platform that could help us manage the new open source tools and technologies that could handle these vast data volumes with the elasticity of the cloud, while controlling cost overruns.
In this book, we take you through the basics of building a data lake operation, from people to technology, employing multiple technologies and frameworks in a cloud-native data platform. You’ll dive into the tools and processes you need for the entire lifecycle of a data lake, from data preparation, storage, and management to distributed computing and analytics. You’ll also explore the unique role that each member of your data team needs to play as you migrate to your cloud-native data platform.