Creating Partitioned Hive table and importing data

We will see how to create a partitioned table in Hive and how to import data into the table.

Partitioning

We can use partitioning feature of Hive to divide a table into different partitions. Each partition of a table is associated with a particular value(s) of partition column(s). Partitioning allows Hive to run queries on a specific set of data in the table based on the value of partition column used in the query.

Without partitioning, any query on the table in Hive will read the entire data in the table.
If we have a large table then queries may take long time to execute on the whole table. We can make Hive to run query only on a specific partition by partitioning the table and running queries on specific partitions. A table can be partitioned on columns like – city, department, year, device etc.

Create table

We will create an Employee table partitioned by department name:-

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
AGE INT,
SALARY BIGINT 
)
COMMENT 'This is Employee table stored as textfile partitioned by DEPARTMENT'
PARTITIONED BY(DEPARTMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Load data into the table Partitions

Input Files :-

Suppose we have 2 departments – HR and BIGDATA.

On our HDFS, we have records for employees from HR department in the file ‘/home/hadoop/hr_employees.csv‘ and records for employees from BIGDATA department in the file ‘/home/hadoop/bigdata_employees.csv‘.

Contents of ‘/home/hadoop/hr_employees.csv‘ :-

1,aarti,28,30000
2,sakshi,22,20000
3,mahesh,25,25000

Contents of ‘/home/hadoop/bigdata_employees.csv‘ :-

10001,rajesh,29,50000
10002,rahul,23,250000
10003,dinesh,35,70000

Load Data :-

Now to import data for employees into their respective partitions in the Hive table, run following queries.

#Load employees data for partition having department as HR
LOAD DATA INPATH '/home/hadoop/hr_employees.csv' INTO TABLE Employee PARTITION (department='HR');

#Load employees data for partition having department as BIGDATA
LOAD DATA INPATH '/home/hadoop/bigdata_employees.csv' INTO TABLE Employee PARTITION (department='BIGDATA');

Once data is loaded in the table partitions, we can see that Hive has created two directories under the Employee table directory on HDFS – /user/hive/warehouse/employee.

Name                 Type
department=BIGDATA   dir
department=HR        dir

Verify Data :-

#Query on partition having value as 'HR'
select * from employee where department='HR';
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 1            | aarti          | 28            | 30000            | HR                   |
| 2            | sakshi         | 22            | 20000            | HR                   |
| 3            | mahesh         | 25            | 25000            | HR                   |
+--------------+----------------+---------------+------------------+----------------------+--+

#Query on partition having value as 'BIGDATA'
select * from employee where department='BIGDATA';
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 10001        | rajesh         | 29            | 50000            | BIGDATA              |
| 10002        | rahul          | 23            | 250000           | BIGDATA              |
| 10003        | dinesh         | 35            | 70000            | BIGDATA              |
+--------------+----------------+---------------+------------------+----------------------+--+

#Query on the complete data
select * from employee;
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 10001        | rajesh         | 29            | 50000            | BIGDATA              |
| 10002        | rahul          | 23            | 250000           | BIGDATA              |
| 10003        | dinesh         | 35            | 70000            | BIGDATA              |
| 1            | aarti          | 28            | 30000            | HR                   |
| 2            | sakshi         | 22            | 20000            | HR                   |
| 3            | mahesh         | 25            | 25000            | HR                   |
+--------------+----------------+---------------+------------------+----------------------+--+

We can see that query for a particular partition reads data from that partition only and therefore the queries on a set of partitions perform fast on partitioned tables. In the query result, the partition column and its value is added as the last column.

Choosing right columns to partition the table is the major task as this will greatly impact the query performance.

4 Comments

  1. Pingback: Inserting Data Using Static Partitioning into a Partitioned Hive Table – My IT Learnings

  2. Pingback: Different Approaches for Inserting Data Using Dynamic Partitioning into a Partitioned Hive Table – My IT Learnings

  3. Pingback: Partitioning in Hive – My IT Learnings

  4. when you are uploading data into partitioned tables you are specifying partitioned column value, so does it means if the file which you are you uploading contains other departments will get disarded from upload ??

Leave a Reply

Your email address will not be published. Required fields are marked *