We will see how to create a Hive table partitioned by multiple columns and how to import data into the table.
Partitioning
We can use partitioning feature of Hive to divide a table into different partitions. Each partition of a table is associated with a particular value(s) of partition column(s). Partitioning allows Hive to run queries on a specific set of data in the table based on the value of partition column used in the query.
Without partitioning, any query on the table in Hive will read the entire data in the table.
If we have a large table then queries may take long time to execute on the whole table. We can make Hive to run query only on a specific partition by partitioning the table and running queries on specific partitions. A table can be partitioned on columns like – city, department, year, device etc.
Create Table
We will create an Employee table partitioned by state and department. The partition columns need not be included in the table definition. They are available to be used in the queries. When the table is partitioned using multiple columns, then Hive creates nested sub-directories based on the order of the partition columns.
CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT ) COMMENT 'This is Employee table stored as textfile partitioned by STATE and DEPARTMENT' PARTITIONED BY(STATE STRING, DEPARTMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load data into the table Partitions
Input Files :-
Suppose we have 2 states – Maharashtra and karnataka and we have 2 departments – HR and BIGDATA.
On our HDFS, we have records for employees from Maharashtra state HR department in the file ‘/home/hadoop/mh_hr_employees.csv‘ and records for employees from Maharashtra state BIGDATA department in the file ‘/home/hadoop/mh_bigdata_employees.csv‘.
We have records for employees from Karnataka state HR department in the file ‘/home/hadoop/kr_hr_employees.csv‘ and records for employees from Karnataka state BIGDATA department in the file ‘/home/hadoop/kr_bigdata_employees.csv‘.
Contents of ‘/home/hadoop/mh_hr_employees.csv‘ :-
1,aarti,28,30000 2,sakshi,22,20000 3,mahesh,25,25000
Contents of ‘/home/hadoop/mh_bigdata_employees.csv‘ :-
10001,rajesh,29,50000 10002,rahul,23,250000 10003,dinesh,35,70000
Contents of ‘/home/hadoop/kr_hr_employees.csv‘ :-
600001,shruti,28,31000 600002,purvi,26,27000 600003,manav,35,54000
Contents of ‘/home/hadoop/kr_bigdata_employees.csv‘ :-
70001,sujay,39,72000 70002,raman,43,850000 70003,ravi,28,60000
Load Data :-
Now to import data for employees into their respective partitions in the Hive table, run following queries.
#Load employees data for partition having State as Maharashtra and department as HR LOAD DATA INPATH '/home/hadoop/mh_hr_employees.csv' INTO TABLE Employee PARTITION (state='Maharashtra', department='HR'); #Load employees data for partition having State as Maharashtra and department as BIGDATA LOAD DATA INPATH '/home/hadoop/mh_bigdata_employees.csv' INTO TABLE Employee PARTITION (state='Maharashtra', department='BIGDATA'); #Load employees data for partition having State as Karnataka and department as HR LOAD DATA INPATH '/home/hadoop/kr_hr_employees.csv' INTO TABLE Employee PARTITION (state='Karnataka', department='HR'); #Load employees data for partition having State as Maharashtra and department as BIGDATA LOAD DATA INPATH '/home/hadoop/kr_bigdata_employees.csv' INTO TABLE Employee PARTITION (state='Karnataka', department='BIGDATA');
Once data is loaded in the table partitions, we can see that Hive has created two directories under the Employee table directory on HDFS – /user/hive/warehouse/employee and two sub-directories under each directory.
/user/hive/warehouse/employee/state=Maharashtra/department=HR /user/hive/warehouse/employee/state=Maharashtra/department=BIGDATA /user/hive/warehouse/employee/state=Karnataka/department=HR /user/hive/warehouse/employee/state=Karnataka/department=BIGDATA
Verify Data :-
select * from employee where state='Maharashtra' and department='HR'; +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.state | employee.department | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | 1 | aarti | 28 | 30000 | Maharashtra | HR | | 2 | sakshi | 22 | 20000 | Maharashtra | HR | | 3 | mahesh | 25 | 25000 | Maharashtra | HR | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ select * from employee where state='Maharashtra' and department='BIGDATA'; +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.state | employee.department | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | Maharashtra | BIGDATA | | 10002 | rahul | 23 | 250000 | Maharashtra | BIGDATA | | 10003 | dinesh | 35 | 70000 | Maharashtra | BIGDATA | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ select * from employee where state='Maharashtra'; +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.state | employee.department | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | Maharashtra | BIGDATA | | 10002 | rahul | 23 | 250000 | Maharashtra | BIGDATA | | 10003 | dinesh | 35 | 70000 | Maharashtra | BIGDATA | | 1 | aarti | 28 | 30000 | Maharashtra | HR | | 2 | sakshi | 22 | 20000 | Maharashtra | HR | | 3 | mahesh | 25 | 25000 | Maharashtra | HR | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ select * from employee; +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.state | employee.department | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+ | 70001 | sujay | 39 | 72000 | Karnataka | BIGDATA | | 70002 | raman | 43 | 850000 | Karnataka | BIGDATA | | 70003 | ravi | 28 | 60000 | Karnataka | BIGDATA | | 600001 | shruti | 28 | 31000 | Karnataka | HR | | 600002 | purvi | 26 | 27000 | Karnataka | HR | | 600003 | manav | 35 | 54000 | Karnataka | HR | | 10001 | rajesh | 29 | 50000 | Maharashtra | BIGDATA | | 10002 | rahul | 23 | 250000 | Maharashtra | BIGDATA | | 10003 | dinesh | 35 | 70000 | Maharashtra | BIGDATA | | 1 | aarti | 28 | 30000 | Maharashtra | HR | | 2 | sakshi | 22 | 20000 | Maharashtra | HR | | 3 | mahesh | 25 | 25000 | Maharashtra | HR | +--------------+----------------+---------------+------------------+-----------------+----------------------+--+
We can see that query for a particular partition reads data from that partition only and therefore the queries on a set of partitions perform fast on partitioned tables. In the query result, the partition columns and their values are added at the last.
If no partitioned columns are used in the query, then all the directories are scanned and partitioning will not have any effect.
Choosing right columns to partition the table is the major task as this will greatly impact the query performance.
Pingback: Inserting Data Using Static Partitioning into a Partitioned Hive Table – My IT Learnings
Pingback: Different Approaches for Inserting Data Using Dynamic Partitioning into a Partitioned Hive Table – My IT Learnings
Pingback: Partitioning in Hive – My IT Learnings