We will see different ways for inserting data using Dynamic partitioning into a Partitioned Hive table.
To know how to create partitioned tables in Hive, go through the following links:-
Creating Partitioned Hive table and importing data
Creating Hive Table Partitioned by Multiple Columns and Importing Data
Dynamic Partitioning
Dynamic partitioning is used when the values for partition columns are not known when loading data into a Hive table.
Some cases where we need to use Dynamic partitioning to load data into hive Tables :-
- The values for partition columns are not known in advance. There may be cases where new partitions are getting added based on value of partition columns in the new data.
- The input data is not segregated partition-wise and the user may not want to segregate the input data per partition and then load it into each partition using static partitioning.
While inserting data using dynamic partitioning into a partitioned Hive table, the partition columns must be specified at the end in the ‘SELECT’ query. This is required for Hive to detect the values of partition columns from the data automatically. The order of partitioned columns should be the same as specified while creating the table.
Hive internally generates Map Reduce job to segregate data based on value of partition columns and then loads that data into their respective partitions.
Partitioned Hive Table
We have a table Employee in Hive partitioned by department:-
0: jdbc:hive2://localhost:10000> desc Employee;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| id | bigint | |
| name | string | |
| age | int | |
| salary | bigint | |
| department | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| department | string | |
+--------------------------+-----------------------+-----------------------+--+
1. Insert Data into Hive table Partitions from Queries
We can load result of a query into Hive table partitions using dynamic partitioning.
Suppose we have another non-partitioned table Employee_old, which store data for employees along-with their departments.
0: jdbc:hive2://localhost:10000> desc Employee_old; +-------------+------------+----------+--+ | col_name | data_type | comment | +-------------+------------+----------+--+ | id | bigint | | | name | string | | | age | int | | | salary | bigint | | | department | string | | +-------------+------------+----------+--+ 0: jdbc:hive2://localhost:10000> select * from Employee_old; +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department | +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | 60001 | Sudip | 34 | 62000 | HR | | 70001 | Suresh | 45 | 76000 | FINANCE | | 80001 | Aarti | 25 | 37000 | BIGDATA | | 80002 | Neha | 27 | 39000 | FINANCE | | 80003 | Rajesh | 29 | 59000 | BIGDATA | +------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from Employee_old table and will insert that data using dynamic partitioning into the Employee table.
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
0: jdbc:hive2://localhost:10000> set hive.exec.dynamic.partition.mode=nonstrict;
To insert data into the partitions of table Employee using a select query on the table Employee_old :-
#Overwrite data from result of a select query into the table partition INSERT OVERWRITE TABLE Employee PARTITION (department) SELECT id, name, age, salary, department from Employee_old; #Append data from result of a select query into the table partition INSERT INTO TABLE Employee PARTITION (department) SELECT id, name, age, salary, department from Employee_old;
Once data is loaded in the table partitions, we can see that Hive has created three directories under the Employee table directory on HDFS – /user/hive/warehouse/employee. Data for each department will reside under its own directory.
Name Type department=BIGDATA dir department=FINANCE dir department=HR dir
Verify Data :-
0: jdbc:hive2://localhost:10000> select * from Employee where department='HR'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 60001 | Sudip | 34 | 62000 | HR | +--------------+----------------+---------------+------------------+----------------------+--+ 0: jdbc:hive2://localhost:10000> select * from Employee where department='BIGDATA'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 80001 | Aarti | 25 | 37000 | BIGDATA | | 80003 | Rajesh | 29 | 59000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+ 0: jdbc:hive2://localhost:10000> select * from Employee where department='FINANCE'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 70001 | Suresh | 45 | 76000 | FINANCE | | 80002 | Neha | 27 | 39000 | FINANCE | +--------------+----------------+---------------+------------------+----------------------+--+
2. Directly insert values
We can directly insert rows into Hive table partitions using Dynamic partitioning. Hive actually dumps the rows into a temporary file per partition and then loads that file into the respective Hive table partition.
#Insert a single row in a table partition INSERT INTO table Employee PARTITION (department) values(50000, 'Rakesh', 28, 57000,'HR'); #Insert Multiple rows in a table partition INSERT INTO table Employee PARTITION (department) values(60001, 'Sudip', 34, 62000, 'HR'),(70001, 'Suresh', 45, 76000, 'BIGDATA');
Pingback: Partitioning in Hive – My IT Learnings