In Hive queries, we can use Sort by, Order by, Cluster by, and Distribute by to manage the ordering and distribution of the output of a SELECT query. We will see this with an example.
We have a table Employee in Hive, partitioned by Department.
0: jdbc:hive2://localhost:10000> desc employee;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| id | bigint | |
| name | string | |
| age | int | |
| salary | bigint | |
| department | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| department | string | |
+--------------------------+-----------------------+-----------------------+--+
We have two Departments in the Employee table – ‘HR’ and BIGDATA’. Lets see the data in the Employee table per partition.
0: jdbc:hive2://localhost:10000> select * from employee where department='HR'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 1 | aarti | 28 | 55000 | HR | | 2 | sakshi | 22 | 60000 | HR | | 3 | mahesh | 25 | 25000 | HR | +--------------+----------------+---------------+------------------+----------------------+--+ 0: jdbc:hive2://localhost:10000> select * from employee where department='BIGDATA'; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10002 | rahul | 23 | 250000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
SORT BY
Hive uses SORT BY to sort the rows based on the given columns per reducer. If there are more than one reducer, then the output per reducer will be sorted, but the order of total output is not guaranteed to be sorted.
If we set the number of reducers to 2, then the query using sort by on ‘salary‘ column will produce the following output:-
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee sort by salary; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 10001 | rajesh | 29 | 50000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 2 | sakshi | 22 | 60000 | HR | | 10003 | dinesh | 35 | 70000 | BIGDATA | | 3 | mahesh | 25 | 25000 | HR | | 10002 | rahul | 23 | 250000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
We can see that the rows in Red are sorted and rows in Blue are sorted among themselves. But the overall sorting order is not maintained. Both these set of rows were processed by different reducers. Sort by is used to sort the data in each reducer according to the user specified order.
ORDER BY
Order by guarantees the total ordering of the output. Even if there are multiple reducers, the overall order of the output is maintained.
The query using order by on ‘salary‘ column will produce the following output:-
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee order by salary; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 3 | mahesh | 25 | 25000 | HR | | 10001 | rajesh | 29 | 50000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 2 | sakshi | 22 | 60000 | HR | | 10003 | dinesh | 35 | 70000 | BIGDATA | | 10002 | rahul | 23 | 250000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
We can see that the overall order of sorting is maintained in the result by using Order By.
DISTRIBUTE BY
Distribute By is used to distribute the rows to different reducers based on the value(s) of column(s). All rows with the same Distribute By columns will go to the same reducer. This is like partitioning in Map-Reduce, where all the records having same value of partition goes to the same reducer. Distribute By does not guarantee clustering the rows, based on the distributed By columns, in the reducers.
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee distribute by id; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 10002 | rahul | 23 | 250000 | BIGDATA | | 2 | sakshi | 22 | 60000 | HR | | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 3 | mahesh | 25 | 25000 | HR | +--------------+----------------+---------------+------------------+----------------------+--+
All rows in Red (with even id) were sent to one reducer and all rows in Blue (with Odd id)were went to second Reducer. Thus we can use distribute by to control the distribution of rows to different reducers.
Distribute by with Sort by
If we club Distribute by with Sort by, then we can control the clustering of rows inside each reducer based on the value of some columns.
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee distribute by id sort by salary; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 2 | sakshi | 22 | 60000 | HR | | 10002 | rahul | 23 | 250000 | BIGDATA | | 3 | mahesh | 25 | 25000 | HR | | 10001 | rajesh | 29 | 50000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 10003 | dinesh | 35 | 70000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
All rows in Red (with even id) were sent to one reducer and all rows in Blue (with Odd id)were went to second Reducer. Inside each reducer, the rows are sorted by salary.
CLUSTER BY
Cluster by is same as ‘Distribute by and Sort by’. But Cluster by does the distribution and sorting on same columns. If we want to distribute by some columns and then sort by some other columns, then we should use ‘ distribute by with sort by’ instead of ‘cluster by’. We will see what a query with ‘cluster by id‘ returns.
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee cluster by id; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 2 | sakshi | 22 | 60000 | HR | | 10002 | rahul | 23 | 250000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 3 | mahesh | 25 | 25000 | HR | | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
The above output is similar to ‘distribute by id sort by id‘ as shown below:-
#set the number of reducers to 2 0: jdbc:hive2://localhost:10000> set mapred.reduce.tasks=2; 0: jdbc:hive2://localhost:10000> select * from employee distribute by id sort by id; +--------------+----------------+---------------+------------------+----------------------+--+ | employee.id | employee.name | employee.age | employee.salary | employee.department | +--------------+----------------+---------------+------------------+----------------------+--+ | 2 | sakshi | 22 | 60000 | HR | | 10002 | rahul | 23 | 250000 | BIGDATA | | 1 | aarti | 28 | 55000 | HR | | 3 | mahesh | 25 | 25000 | HR | | 10001 | rajesh | 29 | 50000 | BIGDATA | | 10003 | dinesh | 35 | 70000 | BIGDATA | +--------------+----------------+---------------+------------------+----------------------+--+
Really good content with easy explanation!!
hi sir
your post clears my concept
please keep up the good work
very nice. please write on more topics related to Hadoop
Its awsome .. I have shared the URL to my friends too ..
too much satisfied by the way it is explained.. Thankyou so much
Excellent manner of explanation..Cleared all concepts..Thanks a ton 🙂