We will see how to create a partitioned table in Hive and how to import data into the table.
Partitioning
We can use partitioning feature of Hive to divide a table into different partitions. Each partition of a table is associated with a particular value(s) of partition column(s). Partitioning allows Hive to run queries on a specific set of data in the table based on the value of partition column used in the query.
Without partitioning, any query on the table in Hive will read the entire data in the table.
If we have a large table then queries may take long time to execute on the whole table. We can make Hive to run query only on a specific partition by partitioning the table and running queries on specific partitions. A table can be partitioned on columns like – city, department, year, device etc.
Create table
We will create an Employee table partitioned by department name:-
CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT ) COMMENT 'This is Employee table stored as textfile partitioned by DEPARTMENT' PARTITIONED BY(DEPARTMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load data into the table Partitions
Input Files :-
Suppose we have 2 departments – HR and BIGDATA.
On our HDFS, we have records for employees from HR department in the file ‘/home/hadoop/hr_employees.csv‘ and records for employees from BIGDATA department in the file ‘/home/hadoop/bigdata_employees.csv‘.
Contents of ‘/home/hadoop/hr_employees.csv‘ :-
1,aarti,28,30000 2,sakshi,22,20000 3,mahesh,25,25000
Contents of ‘/home/hadoop/bigdata_employees.csv‘ :-
10001,rajesh,29,50000 10002,rahul,23,250000 10003,dinesh,35,70000
Load Data :-
Now to import data for employees into their respective partitions in the Hive table, run following queries.
#Load employees data for partition having department as HR LOAD DATA INPATH '/home/hadoop/hr_employees.csv' INTO TABLE Employee PARTITION (department='HR'); #Load employees data for partition having department as BIGDATA LOAD DATA INPATH '/home/hadoop/bigdata_employees.csv' INTO TABLE Employee PARTITION (department='BIGDATA');
Once data is loaded in the table partitions, we can see that Hive has created two directories under the Employee table directory on HDFS – /user/hive/warehouse/employee.
Name Type department=BIGDATA dir department=HR dir
Verify Data :-
#Query on partition having value as 'HR' select * from employee where department='HR'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 1 | aarti | 28 | 30000 | HR | | 2 | sakshi | 22 | 20000 | HR | | 3 | mahesh | 25 | 25000 | HR | +--------------+----------------+---------------+------------------+----------------------+--+ #Query on partition having value as 'BIGDATA' select * from employee where department='BIGDATA'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10002 | rahul | 23 | 250000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+ #Query on the complete data select * from employee; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10002 | rahul | 23 | 250000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | | 1 | aarti | 28 | 30000 | HR | | 2 | sakshi | 22 | 20000 | HR | | 3 | mahesh | 25 | 25000 | HR | +--------------+----------------+---------------+------------------+----------------------+--+
We can see that query for a particular partition reads data from that partition only and therefore the queries on a set of partitions perform fast on partitioned tables. In the query result, the partition column and its value is added as the last column.
Choosing right columns to partition the table is the major task as this will greatly impact the query performance.
Pingback: Inserting Data Using Static Partitioning into a Partitioned Hive Table – My IT Learnings
Pingback: Different Approaches for Inserting Data Using Dynamic Partitioning into a Partitioned Hive Table – My IT Learnings
Pingback: Partitioning in Hive – My IT Learnings
when you are uploading data into partitioned tables you are specifying partitioned column value, so does it means if the file which you are you uploading contains other departments will get disarded from upload ??