Creating Bucketed Table in Hive and Inserting Data

Create Table

For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets.

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING 
)
COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets'
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Inserting Data

We have following data in Employee_old table.

0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id  | employee_old.name  | employee_old.age  | employee_old.salary  | employee_old.department  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1                | Sudip              | 34                | 62000                | HR                       |
| 2                | Suresh             | 45                | 76000                | FINANCE                  |
| 3                | Aarti              | 25                | 37000                | BIGDATA                  |
| 4                | Neha               | 27                | 39000                | FINANCE                  |
| 5                | Rajesh             | 29                | 59000                | BIGDATA                  |
| 6                | Suman              | 37                | 63000                | HR                       |
| 7                | Paresh             | 42                | 71000                | BIGDATA                  |
| 8                | Rami               | 33                | 56000                | HR                       |
| 9                | Arpit              | 41                | 46000                | HR                       |
| 10               | Sanjeev            | 51                | 99000                | FINANCE                  |
| 11               | Sanjay             | 32                | 67000                | FINANCE                  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+

We will select data from the table Employee_old and insert it into our bucketed table Employee.

We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table.

#Set the property
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true;

#Insert data into Bucketed table employee
0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;

Verify the Data in Buckets

Once we execute the INSERT query, we can verify that 5 files are created under the Employee table directory on HDFS.

Name        Type
000000_0    file
000001_0    file
000002_0    file
000003_0    file
000004_0    file

Each file represents a bucket. Let us see the contents of these files.

Content of 000000_0

All records with Hash(ID) mod 5 == 0 goes into this file.

5,Rajesh,29,59000,BIGDATA
10,Sanjeev,51,99000,FINANCE

Content of 000001_0

All records with Hash(ID) mod 5 == 1 goes into this file.

1,Sudip,34,62000,HR
6,Suman,37,63000,HR
11,Sanjay,32,67000,FINANCE

Content of 000002_0

All records with Hash(ID) mod 5 == 2 goes into this file.

2,Suresh,45,76000,FINANCE
7,Paresh,42,71000,BIGDATA

Content of 000003_0

All records with Hash(ID) mod 5 == 3 goes into this file.

3,Aarti,25,37000,BIGDATA
8,Rami,33,56000,HR

Content of 000004_0

All records with Hash(ID) mod 5 == 4 goes into this file.

4,Neha,27,39000,FINANCE
9,Arpit,41,46000,HR

 

Leave a Reply

Your email address will not be published. Required fields are marked *