We can use partitioning feature of Hive to divide a table into different partitions. Each partition of a table is associated with a particular value(s) of partition column(s). Partitioning allows Hive to run queries on a specific set of data in the table based on the value of partition column used in the query.
Without partitioning, any query on the table in Hive will read the entire data from the table.
If we have a large table then queries may take long time to execute on the whole table. We can make Hive to run query only on a specific partition by partitioning the table and running queries on specific partitions. A table can be partitioned on columns like – city, department, year, device etc.
Hive stores the data for each table partition in a separate directory, under the parent directory of table, on HDFS.
Query for a particular partition reads data from that partition only and therefore the queries on a set of partitions perform fast on partitioned tables. Choosing right columns to partition the table is the major task as this will greatly impact the query performance.
To know how to create partitioned tables in Hive, go through the following links:-
Hive provides two ways to create partitions in a partitioned table.
- Static Partitioning
- Dynamic Partitioning
Static partitioning is used when the values for partition columns are known when loading data into a Hive table. The data is assumed to be available partition-wise and then this data is loaded into their respective partitions. Following link shows how to actually load data into table partitions using static partitioning:-
Dynamic partitioning is used when the values for partition columns are not known when loading data into a Hive table.
Some cases where we need to use Dynamic partitioning to load data into hive Tables :-
- The values for partition columns are not known in advance. There may be cases where new partitions are getting added based on value of partition columns in the new data.
- The input data is not segregated partition-wise and the user may not want to segregate the input data per partition and then load it into each partition using static partitioning.
While inserting data using dynamic partitioning into a partitioned Hive table, the partition columns must be specified at the end in the ‘SELECT’ query. This is required for Hive to detect the values of partition columns from the data automatically. The order of partitioned columns should be the same as specified while creating the table.
Hive internally generates Map Reduce job to segregate data based on value of partition columns and then loads that data into their respective partitions.
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
Following link shows how to actually load data into table partitions using dynamic partitioning:-
- One design of partitioning will be useful for some queries but may add overhead for other queries. The overhead may be in terms of maintaining and scanning multiple partition directories. So selecting right columns for partitioning is a very important task.
- Too many partitions will create too many directories under the table directory. This will create overhead on the namenode. Partition column should not have a high cardinality.