We will see different ways for inserting data using static partitioning into a Partitioned Hive table.
To know how to create partitioned tables in Hive, go through the following links:-
Creating Partitioned Hive table and importing data
Creating Hive Table Partitioned by Multiple Columns and Importing Data
Static Partitioning
Static partitioning is used when the values for partition columns are known when loading data into a Hive table. The data is assumed to be available partition-wise and then this data is loaded into their respective partitions.
Partitioned Hive Table
We have a table Employee in Hive partitioned by department:-
0: jdbc:hive2://localhost:10000> desc Employee;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| id | bigint | |
| name | string | |
| age | int | |
| salary | bigint | |
| department | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| department | string | |
+--------------------------+-----------------------+-----------------------+--+
1. Load Data into Table Partitions from File/Directory
We can load data into a Hive table partition directly from a file OR from a directory(all the files in the directory will be loaded in the Hive table partition).
Input Files :-
Suppose we have 2 departments – HR and BIGDATA.
On our HDFS, we have records for employees from HR department in the file ‘/home/hadoop/hr_employees.csv‘ and records for employees from BIGDATA department in the file ‘/home/hadoop/bigdata_employees.csv‘.
Contents of ‘/home/hadoop/hr_employees.csv‘ :-
1,aarti,28,30000 2,sakshi,22,20000 3,mahesh,25,25000
Contents of ‘/home/hadoop/bigdata_employees.csv‘ :-
10001,rajesh,29,50000 10002,rahul,23,250000 10003,dinesh,35,70000
Load Data :-
Now to import data for employees into their respective partitions in the Hive table, run following queries.
#Load employees data for partition having department as HR LOAD DATA INPATH '/home/hadoop/hr_employees.csv' INTO TABLE Employee PARTITION (department='HR'); #Load employees data for partition having department as BIGDATA LOAD DATA INPATH '/home/hadoop/bigdata_employees.csv' INTO TABLE Employee PARTITION (department='BIGDATA');
Once the data is loaded in the table partitions, we can see that Hive has created two directories under the Employee table directory on HDFS – /user/hive/warehouse/employee.
Name Type department=BIGDATA dir department=HR dir
2. Insert Data into Hive table Partitions from Queries
We can load result of a query into a Hive table partition.
Suppose we have another non-partitioned table Employee_old, which store data for employees along-with their departments.
0: jdbc:hive2://localhost:10000> desc Employee_old; +-------------+------------+----------+--+ | col_name | data_type | comment | +-------------+------------+----------+--+ | id | bigint | | | name | string | | | age | int | | | salary | bigint | | | department | string | | +-------------+------------+----------+--+ 0: jdbc:hive2://localhost:10000> select * from Employee_old; +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department | +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | 60001 | Sudip | 34 | 62000 | HR | | 70001 | Suresh | 45 | 76000 | FINANCE | | 80001 | Aarti | 25 | 37000 | BIGDATA | | 80002 | Neha | 27 | 39000 | FINANCE | | 80003 | Rajesh | 29 | 59000 | BIGDATA | +------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from Employee_old table for a given department and will insert that data into the respective partition of Employee table. To insert data into the partitions of table Employee using a select query on the table Employee_old, use the following queries:-
#Overwrite data from result of a select query into the table partition INSERT OVERWRITE TABLE Employee PARTITION (department='HR') SELECT id, name, age, salary from Employee_old where department='HR'; INSERT OVERWRITE TABLE Employee PARTITION (department='BIGDATA') SELECT id, name, age, salary from Employee_old where department='BIGDATA'; #Append data from result of a select query into the table partition INSERT INTO TABLE Employee PARTITION (department='HR') SELECT id, name, age, salary from Employee_old; where department='HR'; INSERT INTO TABLE Employee PARTITION (department='BIGDATA') SELECT id, name, age, salary from Employee_old; where department='BIGDATA';
3. Directly insert values
We can directly insert rows into a Hive table partition. In this case Hive actually dumps the rows into a temporary file and then loads that file into the Hive table partition.
#Insert a single row in a table partition INSERT INTO table Employee PARTITION (department='HR') values(50000, 'Rakesh', 28, 57000); #Insert Multiple rows in a table partition INSERT INTO table Employee PARTITION (department='BIGDATA') values(60001, 'Sudip', 34, 62000),(70001, 'Suresh', 45, 76000);
Pingback: Partitioning in Hive – My IT Learnings