Create Table
For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets.
CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT, DEPARTMENT STRING ) COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets' CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Inserting Data
We have following data in Employee_old table.
0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1 | Sudip | 34 | 62000 | HR |
| 2 | Suresh | 45 | 76000 | FINANCE |
| 3 | Aarti | 25 | 37000 | BIGDATA |
| 4 | Neha | 27 | 39000 | FINANCE |
| 5 | Rajesh | 29 | 59000 | BIGDATA |
| 6 | Suman | 37 | 63000 | HR |
| 7 | Paresh | 42 | 71000 | BIGDATA |
| 8 | Rami | 33 | 56000 | HR |
| 9 | Arpit | 41 | 46000 | HR |
| 10 | Sanjeev | 51 | 99000 | FINANCE |
| 11 | Sanjay | 32 | 67000 | FINANCE |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from the table Employee_old and insert it into our bucketed table Employee.
We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table.
#Set the property 0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true; #Insert data into Bucketed table employee 0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;
Verify the Data in Buckets
Once we execute the INSERT query, we can verify that 5 files are created under the Employee table directory on HDFS.
Name Type 000000_0 file 000001_0 file 000002_0 file 000003_0 file 000004_0 file
Each file represents a bucket. Let us see the contents of these files.
Content of 000000_0
All records with Hash(ID) mod 5 == 0 goes into this file.
5,Rajesh,29,59000,BIGDATA 10,Sanjeev,51,99000,FINANCE
Content of 000001_0
All records with Hash(ID) mod 5 == 1 goes into this file.
1,Sudip,34,62000,HR 6,Suman,37,63000,HR 11,Sanjay,32,67000,FINANCE
Content of 000002_0
All records with Hash(ID) mod 5 == 2 goes into this file.
2,Suresh,45,76000,FINANCE 7,Paresh,42,71000,BIGDATA
Content of 000003_0
All records with Hash(ID) mod 5 == 3 goes into this file.
3,Aarti,25,37000,BIGDATA 8,Rami,33,56000,HR
Content of 000004_0
All records with Hash(ID) mod 5 == 4 goes into this file.
4,Neha,27,39000,FINANCE 9,Arpit,41,46000,HR