Introduction
As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPC-DS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.
Data Model
For my experiments, I used a 500 GB scale TPCDS data set, a 10-node Hadoop cluster, and Hive 0.13.1 running on Qubole Data Service (QDS).
My goal was to calculate various measures of store sales. More specifically, I wanted to calculate:
- Total Extended Price
- Total Sales Price
- Total Net Profit
- Total Wholesale Cost
- Total Coupon Amt
- Total List Price
These measures then need to be broken into many dimensions. For example, we can drill down into the following dimensions (Levels in parentheses):
- Date (Year, Quarter, Month, Day)
- Store Information (Store Id)
- Household Demographics (Number of Dependents, Buy Potential)
- Customer Demographics (Gender, Marital Status, Education Status)
- Ad Channel (TV, Event, Email)
- Time (Hour, Minute)
The ER Diagram for the relevant tables is shown below. It’s a classic star schema.
Introduction to Cubes
This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data models. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.
Build the Cube
Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.
# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276
select count(*) from store_sales_2002_plus;
278035965
Create Cube
I created a cube to store the dimensions and measures I am interested in.
create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price,
sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
sum(ss_list_price) as sum_list_price,
d_year, d_qoy, d_moy, d_date, s_store_id,
cd_gender, cd_marital_status, cd_education_status, grouping__id
from store_sales_2002_plus join item on ss_item_sk = i_item_sk
join customer on ss_customer_sk = c_customer_sk
join date_dim on ss_sold_date_sk = d_date_sk
join customer_demographics on ss_cdemo_sk = cd_demo_sk
join promotion on ss_promo_sk = p_promo_sk
join household_demographics on ss_hdemo_sk = hd_demo_sk
join store on ss_store_sk = s_store_sk
join time_dim on ss_sold_time_sk = t_time_sk
group by d_year, d_qoy, d_moy, d_date, s_store_id,
cd_gender, cd_marital_status, cd_education_status
with cube;
select count(*) from store_sales_cube;
1586304
The above query generates aggregates for all possible combinations of groups by columns.
The schema of store_sales_cube is:
Column Data Type sum_extended_price double sum_sales_price double sum_net_profit double sum_wholesale_cost double sum_coupon_amt double sum_list_price double d_year int d_qoy int d_moy int d_date timestamp s_store_id string cd_gender string cd_marital_status string cd_education_status string grouping__id string
A few example rows are shown below.
d_year d_qoy d_moy d_date s_store_id s_store_name cd_gender cd_marital_status cd_educational_status grouping__id Total Sales Price 2002 3 NULL NULL NULL NULL NULL NULL NULL 3 2.712...E9 NULL NULL NULL NULL NULL NULL NULL NULL NULL 0 1.00...E10 NULL NULL NULL NULL NULL NULL M NULL NULL 64 5.01..E9
The first-row stores’ measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of
select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;
The second-row stores the measures for the complete data set.
The third-row stores the measures for cd_gender=M only. It is one of the rows in the result of
select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk
group by cd_gender;
Grouping ID
Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when the CUBE keyword is used. I specified it in the project list to use in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube to create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.
Group By Columns grouping_id cd_educational_status cd_marital_status cd_gender s_store_name s_store_id d_date d_moy d_qoy d_year cd_gender 64 0 0 1 0 0 0 0 0 0 d_year - d_qoy 3 0 0 0 0 0 0 0 1 1 d_year - d_moy 5 0 0 0 0 0 0 1 0 1 d_year - d_qoy - cd_marital_status 131 0 1 0 0 0 0 0 1 1
Let’s look at an example of rows for the cd_gender dimension.
select cd_gender, sum_sales_price
from store_sales_cube where
`grouping__id` = conv("001000000", 2, 10);
cd_gender total_sales_price M 5.017321159230397E9 F 5.01904028465792E9
Conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string, and the base of the result as arguments.
Total Sales Price for each quarter
Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.
Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;
Bytes Read: 776,170,833
Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
from store_sales_cube where
`grouping__id` = conv("000000011", 2, 10);
Bytes Read: 11,783,322
Query on the cube scanned 1.5% of the data compared to the query on raw data tables.
Total Sales for each quarter to married customers
Let’s look at another example. The following query filters the results by another dimension – cd_marital_status
select d_year, d_qoy, cd_marital_status, sum_sales_price
from store_sales_cube where
`grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";
GROUPING Functions in other databases
Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. e.g. GROUPING_ID in SQL Server.
Summary
In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.