Big data has become an essential requirement for enterprises looking to harness their business potential. The use cases for big data are endless and range from customer targeting and fraud analytics to anomaly detection and more. This data can be generated quickly from various sources such as users’ browser and search history, credit card payments, mobile pinging of the nearest cell phone tower, etc. Given the volume of sensitive information being captured, any unauthorized or accidental disclosure of or access to the data can have severe consequences for your enterprise, both in financial terms and in more intangible ways, such as the loss of brand recognition and users’ trust.
In recent years, many highly scalable and complex processing frameworks for big data have emerged, such as Hadoop, Hive, Presto, and Spark. Securing these frameworks is very challenging because of their distributed nature, and involves many touchpoints, services, and operational processes. With accelerating cloud adoption, monitoring access and data flows for big data becomes even more complicated.
Why Access Controls Matter
Many enterprises attempt to secure data by means of encryption and perimeter control — but without a comprehensive, granular data-access control strategy. Such a strategy is crucial because multiple employees with different levels of authority, responsibility, and competency will be running different jobs on the platform. For example, the marketing team needs access to sales data to analyze customer churn and sentiment, while the finance team needs access to financial data in order to produce financial projections.
When hundreds or thousands of employees need access to data for many different uses, coarse-grained permissions that give users “all or nothing” access are no longer sufficient. Instead, you need a set of scalable, consistent, and fine-grained control capabilities that prevent unnecessary access to sensitive information at every stage of processing. Furthermore, growing public concern and new global compliance requirements make implementing such a solution all the more urgent.
This blog post is part of a three-part series that explores the challenges and nuances of administering granular data access for cloud-based big data. This post outlines the general best practices for granular access controls and the security features that Qubole provides. In the coming posts, we will demonstrate how to adequately set roles and permissions and discuss the additional granular access control resources your enterprise can use on Qubole’s platform.
How to Leverage Granular Access Controls
Implementing an adequate access control strategy in addition to authentication can be very effective in reducing the attack surface for the most common types of data breaches, such as phishing and social engineering attacks. According to the Cloud Security Alliance, a comprehensive granular access control strategy includes the following elements:
- Normalizing mutable elements and denormalizing immutable elements
- Tracking secrecy requirements and ensuring proper implementation
- Maintaining access labels
- Tracking administrator data
- Using Single Sign-On (SSO)
- Using a labeling scheme to maintain proper data federation
Qubole currently supports many of the listed features above that are applicable on our platform. However, options and configurations for access controls will vary according to the cloud service model and provider-specific capabilities. At Qubole, we strive to deliver a data-access control solution that is the engine- and cloud-agnostic. We enable access controls at a minimum of three levels, from data ingest to data access: the infrastructure, platform, and data levels.
At the Infrastructure Level
Access controls can be placed to limit access to cloud infrastructure resources and services. For instance, in AWS, enterprise system administrators can leverage IAM roles to restrict Qubole’s access to your AWS resources such as S3 storage and EC2 instances. For added granularity, enterprises can create dual IAM roles, where one role acts as the cross-account IAM with access across AWS accounts and the other role with restricted access to the data in AWS S3 buckets.
In Azure, system administrators can restrict Qubole’s access to Azure resources using Azure Active Directory and IAM roles under Azure RBAC. Administrators can configure an application for Qubole under Active Directory and assign it either the Contributor IAM role or create a custom IAM role to further limit access to compute resources. Qubole’s access to storage resources can be limited using storage keys for Azure blob storage and Active Directory-integrated OAuth tokens for Azure Data Lake Store (ADLS).
At the Platform Level
On Qubole’s platform, system administrators can leverage Qubole’s built-in Role-Based Access Control (RBAC) capabilities to restrict users’ access to specific platform artifacts such as clusters, notebooks, and dashboards. Customers can either use predefined roles or create custom roles. Custom roles provide granular permissions based on business functions to grant users varying degrees of access and permissible actions on Qubole’s platform. RBAC to Qubole’s artifacts can help achieve privacy, improve operational efficiency, and reduce the administrative burden of policy management.
With RBAC on Qubole, enterprises also enjoy the added advantage of better financial governance through cost reduction, because administrators can allocate more compute resources exclusively to high-value users without disrupting other users.
For more information on how to manage roles on Qubole, see our documentation on managing access permissions and roles.
At the Data Level
Given increased public concern and stiff legal penalties, data-level access controls are essential, especially for enterprises that process personally identifiable information, protected health information, or other sensitive information.
Qubole provides granular access controls through Hive authorization. Qubole’s Hive authorization affords users the ability to implement granular access to tables across use cases and engines such as Spark, Presto, and Hive. Going forward, we will extend access control capabilities to row filtering, column-level access control, and data masking. Our goal is to offer our customers a sufficient level of granular controls to achieve the best data access governance.
What’s Next?
As concerns for data privacy and protection escalate, Qubole is dedicated to equipping enterprises with a multi-layer access control solution that is data-engine and cloud-agnostic. In the following blog posts, we will discuss the best practices for setting user permissions and showcase the additional granular access-control resources your enterprise can deploy on Qubole’s platform.