Qubole has defined a new security model to improve enterprise-level security and data governance on the cloud. The model integrates the cloud vendor’s storage authorization with Hive authorization. This improves usability for both cloud-storage administrators and Data Administrators (DBA) while eliminating errors that arise from end-user authorization problems.
This is an important milestone on the way to Qubole’s goal of building a secure, enterprise-level cloud platform. Qubole is one of the first vendors to add cloud storage-level checks at query compile time, and consequently offers the most secure platform in the cloud.
History of Database Security Model
In traditional databases (RDBMS and NoSQL), the database had complete control over the catalog, compute, and storage, so administrators and users used the database as the single source of truth for authentication and authorization.
In early versions of Apache Hive, the catalog (Hive Metastore), compute (M/R), and storage (HDFS) were separate. Seamless authorization was not possible because each system had to administer independently. Apache Hive released SQL Standard Based Hive Authorization in Hive 0.13, which integrated all the systems. This mechanism runs storage-level checks through the Metastore server and checks the GRANT tables during query compile. As a prerequisite, a user must be registered with Hive and HDFS, and both must have a single view of users and their credentials. Similarly, Apache Ranger and Apache Sentry both provide HDFS-level file permission checks when the same user is trying to access tables through SQL.
Cloud Security Model
In the case of public clouds, data is stored in the cloud, and not in HDFS. Authorization on cloud storage is managed by the cloud vendor. SQL Standard Based Hive Authorization in Hive does not work in this case because there is no unified view of users and credentials.
Consider the typical representation of a Hive table in the cloud depicted in the diagram above:
A Hive table consists of files in the cloud storage, and catalog information in the Hive Metastore. Consequently, users have to be assigned Roles/Keys for cloud storage as well as USER or ROLE with the Hive database. The authorization modules of the cloud storage and the database are separate, and it is very hard to keep them synchronized. A mapping or coordination is needed to manage the pairs of roles and use them effectively.
Qubole Security Model
Qubole has integrated cloud storage authorization with Hive authorization by introducing a new security model. In QDS, users can choose between two models:
L1: User uses cloud storage permissions to control table access. Users may also combine SQL statements to define access policy and rules, but will mainly use such statements for error-proofing (e.g. in the case of a user accidentally dropping a table).
We’ll discuss the two models in detail below, using AWS S3 storage as an example.
L1: Cloud Storage Authorization(AWS)
If the organization has 2 tables:
Table | User Access | S3 location |
A | User1 | s3://org/datawarehouse/tables/A |
B | User2 | s3://org/datawarehouse/tables/B |
What administrator needs to define in User1 and User2’s IAM policy:
For user1’s IAM role:
... { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::org/datawarehouse/tables/A/*" }, ... For user2’s IAM-role: ... { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::org/datawarehouse/tables/B/*" }, ...
Some issues with L1:
Fine-Grained Data Access Control | Cannot support because users get access to the entire set of files. |
Storage Administrator Maintenance | Need to create IAM roles per user or user group to define data directory access according to table access. In the case of any directory restructure, need to update the IAM policies. |
Auditing | At file level, which could be verbose (imagine a join across multiple tables, multiple partitions). |
DBA | Either do nothing or redo authorization policies via SQL statement as error-proofing. |
L2: Qubole SQL Authorization
L2’s design principle is to provide Database Administrators with a unified view of database and storage permissions. No coordination with cloud storage administrators is needed.
Qubole Hive implements this design principle by assigning a separate role to the compute nodes (IAM-C). This role provides access to all the data.
Qubole Hive then executes the following checks query compile time:
- GRANT tables in metastore
- Storage-level checks with the user’s IAM role (IAM-A) as needed for location-related DDLs only.
The compute role and query compile-time checks provide a seamless way for database administrators to define authorization policies without needing to make any changes in cloud storage policies.
Advantages of L2 vs. L1:
Fine-Grained Data Access Control | Achievable through Hive Views. |
Storage Administrator Maintenance | No need to define data directory access according to the table access in each IAM role setup. No work for any data directory restructures in the protected data warehouse bucket or folder. |
Auditing | At table level through Hive logs. |
DBA | Using Standard SQL to control user access. |
L2: Limitations
The L2 model covers data access only through metadata policies in QDS.
Qubole recommends customers also secure their cluster access outside of Qubole (e.g. direct access to clusters or storage) to build a completely secure environment.
Use Case
A big organization has three departments: Sales, Marketing, and Finance. These need to have different privileges for access to tables stored in the protected data area.
Tables | |||
Customer | Store_sales | Promotion | |
Sales | Read/Write | Read/Write | Read |
Marketing | Read | Read | Read/Write |
Finance | Read | Read | Read |
Storage Configuration
3 tables are stored in an S3 data warehouse which is a protected area
Protected Data Area:
s3a://datalake/datawarehouse/customer/ (customer table)
s3a://datalake/datawarehouse/store_sales/ (store_sales table)
s3a://datalake/datawarehouse/promotion/ (promotion table)
IAM Role Creation
IAM-Cluster: has full access to s3a://datalake/datawarehouse
IAM-AccountDefault: has access only to s3a://datalake/defloc/
if the admin wants to add a temp folder for everyone: s3a://datalake/temp
Then follow Dual-IAM Role Documentation to set up the Dual-IAM role for the account
QDS Account Setup
All users can belong to the same QDS account. Here’s a user list that the administrator needs to grant permission to.
User Group / Department | User |
Finance | [email protected] |
Marketing | [email protected] |
Sales | [email protected] |
On the Account Settings page in the QDS UI, put IAM-AccountDefault role credentials on the setting page.
HiveQL DDL Privilege Setup
After Hive Authorization is enabled for the account.
Set role admin; Use demo_database; Create role sales; Create role marketing; Create role finance; Grant select, insert on customer to role sales; Grant select on customer to role finance; Grant select on customer to role marketing; Grant select, insert on store_sales to role sales; Grant select on store_sales to role finance; Grant select on store_sales to role marketing; Grant select on promotion to role sales; Grant select on promotion to role finance; Grant select, insert on promotion to role marketing GRANT role finance to USER john; GRANT role sales to USER david; GRANT role marketing to USER mary;
Future Work
Other engines (Spark, Presto, etc.) and applications (e.g. Zeppelin Notebook) will support the L2 Model over the next couple of releases. We’ll write blogs to elaborate on the mechanism and provide use cases for these engines to demonstrate our cross-engine security solution for data authorization.