Create Table
A bucketed and sorted table stores the data in different buckets and the data in each bucket is sorted according to the column specified in the SORTED BY clause while creating the table. For creating a bucketed and sorted table, we need to use CLUSTERED BY (columns) SORTED BY (columns) to define the columns for bucketing, sorting and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets and each bucket is sorted on AGE.
CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT, DEPARTMENT STRING ) COMMENT 'This is Employee table clustered by id sorted by age into 5 buckets' CLUSTERED BY(ID) SORTED BY(AGE)INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Inserting Data
We have following data in Employee_old table.
0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1 | Sudip | 34 | 62000 | HR |
| 2 | Suresh | 45 | 76000 | FINANCE |
| 3 | Aarti | 25 | 37000 | BIGDATA |
| 4 | Neha | 27 | 39000 | FINANCE |
| 5 | Rajesh | 29 | 59000 | BIGDATA |
| 6 | Suman | 37 | 63000 | HR |
| 7 | Paresh | 42 | 71000 | BIGDATA |
| 8 | Rami | 33 | 56000 | HR |
| 9 | Arpit | 41 | 46000 | HR |
| 10 | Sanjeev | 51 | 99000 | FINANCE |
| 11 | Sanjay | 32 | 67000 | FINANCE |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from the table Employee_old and insert it into our bucketed table Employee.
We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table. We also need to set the property ‘hive.enforce.sorting‘ to true, this will enforce sorting while inserting data into each bucket.
#Set the property to enforce Bucketing 0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true; #Set the property to enforce Sorting 0: jdbc:hive2://localhost:10000> set hive.enforce.sorting=true; #Insert data into Bucketed table employee 0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;
Verify the Data in Buckets
Once we execute the INSERT query, we can verify that 5 files are created under the Employee table directory on HDFS.
Name Type 000000_0 file 000001_0 file 000002_0 file 000003_0 file 000004_0 file
Each file represents a bucket. Let us see the contents of these files.
Content of 000000_0
All records with Hash(ID) mod 5 == 0 goes into this file. All records are sorted on Age column.
5,Rajesh,29,59000,BIGDATA 10,Sanjeev,51,99000,FINANCE
Content of 000001_0
All records with Hash(ID) mod 5 == 1 goes into this file. All records are sorted on Age column.
11,Sanjay,32,67000,FINANCE 1,Sudip,34,62000,HR 6,Suman,37,63000,HR
Content of 000002_0
All records with Hash(ID) mod 5 == 2 goes into this file. All records are sorted on Age column.
7,Paresh,42,71000,BIGDATA 2,Suresh,45,76000,FINANCE
Content of 000003_0
All records with Hash(ID) mod 5 == 3 goes into this file. All records are sorted on Age column.
3,Aarti,25,37000,BIGDATA 8,Rami,33,56000,HR
Content of 000004_0
All records with Hash(ID) mod 5 == 4 goes into this file. All records are sorted on Age column.
4,Neha,27,39000,FINANCE 9,Arpit,41,46000,HR