We will see some of the differences between partitioning and bucketing in Hive.
Partitioning
- Partitioning is used to divide the table into different partitions. Each partition is stored as a different directory.
- A partition is created for each unique value of the partition column.
- Hierarchical partitioning can be done by specifying the partitioning columns in a sequence as per the hierarchy like Country, State, City.
- We cannot control the number of partitions if the value of partitioning columns have a very high cardinality.
- Partitioning allows hive to avoid full table scan if partition columns are used in the where clause of hive query. A query containing partition columns in the where clause will scan directories for specific partition only.
Bucketing
- Bucketing is used to distribute/organize the data into fixed number of buckets.
- Each bucket is stored as a file under the Table/Partition directory.
- The number of buckets are fixed at the table creation time. All the data will be distributed into these buckets based on the hash value of the bucketing columns.
- Which records go to which bucket are decided by the Hash value of columns used for bucketing.
- A Bucket will have all the records for same value of bucketing columns.
- A Bucket will have all the records for same Hash value of bucketing columns. So records having different value of bucketing columns but having same hash value will go into the same bucket.
- Bucketing is used for efficient map-side joins between bucketed tables and for effectively executing sampling queries.