Creating Bucketed and Sorted Table in Hive and Inserting Data

Create Table

A bucketed and sorted table stores the data in different buckets and the data in each bucket is sorted according to the column specified in the SORTED BY clause while creating the table. For creating a bucketed and sorted table, we need to use CLUSTERED BY (columns) SORTED BY (columns) to define the columns for bucketing, sorting and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets and each bucket is sorted on AGE.

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING 
)
COMMENT 'This is Employee table clustered by id sorted by age into 5 buckets'
CLUSTERED BY(ID) SORTED BY(AGE)INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Inserting Data

We have following data in Employee_old table.

0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id  | employee_old.name  | employee_old.age  | employee_old.salary  | employee_old.department  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1                | Sudip              | 34                | 62000                | HR                       |
| 2                | Suresh             | 45                | 76000                | FINANCE                  |
| 3                | Aarti              | 25                | 37000                | BIGDATA                  |
| 4                | Neha               | 27                | 39000                | FINANCE                  |
| 5                | Rajesh             | 29                | 59000                | BIGDATA                  |
| 6                | Suman              | 37                | 63000                | HR                       |
| 7                | Paresh             | 42                | 71000                | BIGDATA                  |
| 8                | Rami               | 33                | 56000                | HR                       |
| 9                | Arpit              | 41                | 46000                | HR                       |
| 10               | Sanjeev            | 51                | 99000                | FINANCE                  |
| 11               | Sanjay             | 32                | 67000                | FINANCE                  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+

We will select data from the table Employee_old and insert it into our bucketed table Employee.

We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table. We also need to set the property ‘hive.enforce.sorting‘ to true, this will enforce sorting while inserting data into each bucket.

#Set the property to enforce Bucketing
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true;

#Set the property to enforce Sorting
0: jdbc:hive2://localhost:10000> set hive.enforce.sorting=true;

#Insert data into Bucketed table employee
0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;

Verify the Data in Buckets

Once we execute the INSERT query, we can verify that 5 files are created under the Employee table directory on HDFS.

Name        Type
000000_0    file
000001_0    file
000002_0    file
000003_0    file
000004_0    file

Each file represents a bucket. Let us see the contents of these files.

Content of 000000_0

All records with Hash(ID) mod 5 == 0 goes into this file. All records are sorted on Age column.

5,Rajesh,29,59000,BIGDATA
10,Sanjeev,51,99000,FINANCE

Content of 000001_0

All records with Hash(ID) mod 5 == 1 goes into this file. All records are sorted on Age column.

11,Sanjay,32,67000,FINANCE
1,Sudip,34,62000,HR
6,Suman,37,63000,HR

Content of 000002_0

All records with Hash(ID) mod 5 == 2 goes into this file. All records are sorted on Age column.

7,Paresh,42,71000,BIGDATA
2,Suresh,45,76000,FINANCE

Content of 000003_0

All records with Hash(ID) mod 5 == 3 goes into this file. All records are sorted on Age column.

3,Aarti,25,37000,BIGDATA
8,Rami,33,56000,HR

Content of 000004_0

All records with Hash(ID) mod 5 == 4 goes into this file. All records are sorted on Age column.

4,Neha,27,39000,FINANCE
9,Arpit,41,46000,HR

Leave a Reply

Your email address will not be published. Required fields are marked *