CUBE Keyword in Apache Hive

Start Free Trial
June 19, 2015 by Updated May 2nd, 2024

Introduction

As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPC-DS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.

Data Model

For my experiments, I used a 500 GB scale TPCDS data set, a 10-node Hadoop cluster, and Hive 0.13.1 running on Qubole Data Service (QDS).

My goal was to calculate various measures of store sales. More specifically, I wanted to calculate:

  • Total Extended Price
  • Total Sales Price
  • Total Net Profit
  • Total Wholesale Cost
  • Total Coupon Amt
  • Total List Price

These measures then need to be broken into many dimensions. For example, we can drill down into the following dimensions (Levels in parentheses):

  • Date (Year, Quarter, Month, Day)
  • Store Information (Store Id)
  • Household Demographics (Number of Dependents, Buy Potential)
  • Customer Demographics (Gender, Marital Status, Education Status)
  • Ad Channel (TV, Event, Email)
  • Time (Hour, Minute)

The ER Diagram for the relevant tables is shown below. It’s a classic star schema.

Cube-Rollup-Example-crop-redux

Introduction to Cubes

This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data models. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Build the Cube

Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.


# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276

select count(*) from store_sales_2002_plus;
278035965

Create Cube

I created a cube to store the dimensions and measures I am interested in.


create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price, 
             sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
             sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
             sum(ss_list_price) as sum_list_price, 
             d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status, grouping__id
      from store_sales_2002_plus join item on ss_item_sk = i_item_sk 
             join customer on ss_customer_sk = c_customer_sk 
             join date_dim on ss_sold_date_sk = d_date_sk 
             join customer_demographics on ss_cdemo_sk = cd_demo_sk 
             join promotion on ss_promo_sk = p_promo_sk 
             join household_demographics on ss_hdemo_sk = hd_demo_sk 
             join store on ss_store_sk = s_store_sk 
             join time_dim on ss_sold_time_sk = t_time_sk
        group by d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status
        with cube;

select count(*) from store_sales_cube;
1586304

The above query generates aggregates for all possible combinations of groups by columns.
The schema of store_sales_cube is:

ColumnData Type
sum_extended_pricedouble
sum_sales_pricedouble
sum_net_profitdouble
sum_wholesale_costdouble
sum_coupon_amtdouble
sum_list_pricedouble
d_yearint
d_qoyint
d_moyint
d_datetimestamp
s_store_idstring
cd_genderstring
cd_marital_statusstring
cd_education_statusstring
grouping__idstring

A few example rows are shown below.

d_yeard_qoyd_moyd_dates_store_ids_store_namecd_gendercd_marital_statuscd_educational_statusgrouping__idTotal Sales Price
20023NULLNULLNULLNULLNULLNULLNULL32.712...E9
NULLNULLNULLNULLNULLNULLNULLNULLNULL01.00...E10
NULLNULLNULLNULLNULLNULLMNULLNULL645.01..E9

The first-row stores’ measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of


select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

The second-row stores the measures for the complete data set.

The third-row stores the measures for cd_gender=M only. It is one of the rows in the result of


select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk 
group by cd_gender;

Grouping ID

Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when the CUBE keyword is used. I specified it in the project list to use in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube to create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.

Group By Columnsgrouping_idcd_educational_statuscd_marital_statuscd_genders_store_names_store_idd_dated_moyd_qoyd_year
cd_gender64001000000
d_year - d_qoy3000000011
d_year - d_moy5000000101
d_year - d_qoy - cd_marital_status131010000011

Let’s look at an example of rows for the cd_gender dimension.


select cd_gender, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("001000000", 2, 10);

cd_gendertotal_sales_price
M5.017321159230397E9
F5.01904028465792E9

Conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string, and the base of the result as arguments.

Total Sales Price for each quarter

Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.


Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

Bytes Read: 776,170,833

Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
   from store_sales_cube where 
   `grouping__id` = conv("000000011", 2, 10);
Bytes Read: 11,783,322

Query on the cube scanned 1.5% of the data compared to the query on raw data tables.

Total Sales for each quarter to married customers

Let’s look at another example. The following query filters the results by another dimension – cd_marital_status


select d_year, d_qoy, cd_marital_status, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";

GROUPING Functions in other databases

Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. e.g. GROUPING_ID in SQL Server.

Summary

In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.

Start Free Trial
Read Hadoop is Hard! But Big Data Doesn’t Have To Be