Different Approaches for Inserting Data Using Static Partitioning into a Partitioned Hive Table

We will see different ways for inserting data using static partitioning into a Partitioned Hive table.

To know how to create partitioned tables in Hive, go through the following links:-

Creating Partitioned Hive table and importing data
Creating Hive Table Partitioned by Multiple Columns and Importing Data

Static Partitioning

Static partitioning is used when the values for partition columns are known when loading data into a Hive table. The data is assumed to be available partition-wise and then this data is loaded into their respective partitions.

Partitioned Hive Table

We have a table Employee in Hive partitioned by department:-

0: jdbc:hive2://localhost:10000> desc Employee;
+--------------------------+-----------------------+-----------------------+--+
|         col_name         |       data_type       |        comment        |
+--------------------------+-----------------------+-----------------------+--+
| id                       | bigint                |                       |
| name                     | string                |                       |
| age                      | int                   |                       |
| salary                   | bigint                |                       |
| department               | string                |                       |
|                          | NULL                  | NULL                  |
| # Partition Information  | NULL                  | NULL                  |
| # col_name               | data_type             | comment               |
|                          | NULL                  | NULL                  |
| department               | string                |                       |
+--------------------------+-----------------------+-----------------------+--+

1. Load Data into Table Partitions from File/Directory

We can load data into a Hive table partition directly from a file OR from a directory(all the files in the directory will be loaded in the Hive table partition).

Input Files :-

Suppose we have 2 departments – HR and BIGDATA.

On our HDFS, we have records for employees from HR department in the file ‘/home/hadoop/hr_employees.csv‘ and records for employees from BIGDATA department in the file ‘/home/hadoop/bigdata_employees.csv‘.

Contents of ‘/home/hadoop/hr_employees.csv‘ :-

1,aarti,28,30000
2,sakshi,22,20000
3,mahesh,25,25000

Contents of ‘/home/hadoop/bigdata_employees.csv‘ :-

10001,rajesh,29,50000
10002,rahul,23,250000
10003,dinesh,35,70000

Load Data :-

Now to import data for employees into their respective partitions in the Hive table, run following queries.

#Load employees data for partition having department as HR
LOAD DATA INPATH '/home/hadoop/hr_employees.csv' 
INTO TABLE Employee PARTITION (department='HR');

#Load employees data for partition having department as BIGDATA
LOAD DATA INPATH '/home/hadoop/bigdata_employees.csv' 
INTO TABLE Employee PARTITION (department='BIGDATA');

Once the data is loaded in the table partitions, we can see that Hive has created two directories under the Employee table directory on HDFS – /user/hive/warehouse/employee.

Name                 Type
department=BIGDATA   dir
department=HR        dir

 

2. Insert Data into Hive table Partitions from Queries

We can load result of a query into a Hive table partition.

Suppose we have another non-partitioned table Employee_old, which store data for employees along-with their departments.

0: jdbc:hive2://localhost:10000> desc Employee_old;
+-------------+------------+----------+--+
|  col_name   | data_type  | comment  |
+-------------+------------+----------+--+
| id          | bigint     |          |
| name        | string     |          |
| age         | int        |          |
| salary      | bigint     |          |
| department  | string     |          |
+-------------+------------+----------+--+

0: jdbc:hive2://localhost:10000> select * from Employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id  | employee_old.name  | employee_old.age  | employee_old.salary  | employee_old.department  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 60001            | Sudip              | 34                | 62000                | HR                       |
| 70001            | Suresh             | 45                | 76000                | FINANCE                  |
| 80001            | Aarti              | 25                | 37000                | BIGDATA                  |
| 80002            | Neha               | 27                | 39000                | FINANCE                  |
| 80003            | Rajesh             | 29                | 59000                | BIGDATA                  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+

We will select data from Employee_old table for a given department and will insert that data into the respective partition of  Employee table. To insert data into the partitions of table Employee using a select query on the table Employee_old, use the following queries:-

#Overwrite data from result of a select query into the table partition
INSERT OVERWRITE TABLE Employee PARTITION (department='HR') 
SELECT id, name, age, salary from Employee_old where department='HR';

INSERT OVERWRITE TABLE Employee PARTITION (department='BIGDATA') 
SELECT id, name, age, salary from Employee_old where department='BIGDATA';

#Append data from result of a select query into the table partition
INSERT INTO TABLE Employee PARTITION (department='HR') 
SELECT id, name, age, salary from Employee_old; where department='HR';

INSERT INTO TABLE Employee PARTITION (department='BIGDATA') 
SELECT id, name, age, salary from Employee_old; where department='BIGDATA';

3. Directly insert values

We can directly insert rows into a Hive table partition. In this case Hive actually dumps the rows into a temporary file and then loads that file into the Hive table partition.

#Insert a single row in a table partition
INSERT INTO table Employee PARTITION (department='HR')
values(50000, 'Rakesh', 28, 57000);

#Insert Multiple rows in a table partition
INSERT INTO table Employee PARTITION (department='BIGDATA')
values(60001, 'Sudip', 34, 62000),(70001, 'Suresh', 45, 76000);

One Comment

  1. Pingback: Partitioning in Hive – My IT Learnings

Leave a Reply

Your email address will not be published. Required fields are marked *