Running Sampling Queries in Hive

We will see how to run sampling queries in Hive.

Hive Table

We have the following table Employee in Hive, bucketed by ID into 5 buckets:-

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING 
)
COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets'
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

We have the following data in the Employee table:-

select * from Employee;
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 5            | Rajesh         | 29            | 59000            | BIGDATA              |
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
| 11           | Sanjay         | 32            | 67000            | FINANCE              |
| 1            | Sudip          | 34            | 62000            | HR                   |
| 6            | Suman          | 37            | 63000            | HR                   |
| 7            | Paresh         | 42            | 71000            | BIGDATA              |
| 2            | Suresh         | 45            | 76000            | FINANCE              |
| 3            | Aarti          | 25            | 37000            | BIGDATA              |
| 8            | Rami           | 33            | 56000            | HR                   |
| 4            | Neha           | 27            | 39000            | FINANCE              |
| 9            | Arpit          | 41            | 46000            | HR                   |
+--------------+----------------+---------------+------------------+----------------------+--+

Sampling by Bucketing

We can run bucketing sampling queries on bucketed as well as non-bucketed tables. Also we can use bucketing columns as well as other columns for sampling.

Running Sampling queries on Table Bucketing column(s)

We will run our sampling queries in on ID(table is Bucketed on this column) column. These queries will use the already created buckets from the table. For a bucketed table we need not specify the column for sampling, in this case the columns used for bucketing the table will be used for sampling.

For non-bucketed tables specifying the column(s) for sampling is mandatory.

#By default, the column ID will be used for sampling if sampling column is not specified.
#The below query will fetch data from the 1st bucket
select * from Employee TABLESAMPLE(BUCKET 1 OUT OF 5);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 5            | Rajesh         | 29            | 59000            | BIGDATA              |
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
+--------------+----------------+---------------+------------------+----------------------+--+

# We can specify the ID column for sampling.
#The following query will produce the same result as the above query
select * from Employee TABLESAMPLE(BUCKET 1 OUT OF 5 ON ID);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 5            | Rajesh         | 29            | 59000            | BIGDATA              |
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
+--------------+----------------+---------------+------------------+----------------------+--+

#The below query will fetch data from the 2nd bucket
select * from Employee TABLESAMPLE(BUCKET 2 OUT OF 5);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 11           | Sanjay         | 32            | 67000            | FINANCE              |
| 1            | Sudip          | 34            | 62000            | HR                   |
| 6            | Suman          | 37            | 63000            | HR                   |
+--------------+----------------+---------------+------------------+----------------------+--+

Running Sampling Queries on Table non-Bucketing column(s)

The columns which are not used for bucketing the table can also be used for sampling. In this case the query will scan the whole table and create the buckets as per the column(s) specified in the sampling query.

We will run our sampling query on AGE column.

select * from Employee TABLESAMPLE(BUCKET 2 OUT OF 5 ON AGE);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
| 9            | Arpit          | 41            | 46000            | HR                   |
+--------------+----------------+---------------+------------------+----------------------+--+

Block Sampling

We can run block sampling queries on a hive table as follows:-

Using PERCENT

select * from Employee TABLESAMPLE(20 PERCENT);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 5            | Rajesh         | 29            | 59000            | BIGDATA              |
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
+--------------+----------------+---------------+------------------+----------------------+--+

Using ROWS

select * from Employee TABLESAMPLE(2 ROWS);
+--------------+----------------+---------------+------------------+----------------------+--+
| employee.id  | employee.name  | employee.age  | employee.salary  | employee.department  |
+--------------+----------------+---------------+------------------+----------------------+--+
| 5            | Rajesh         | 29            | 59000            | BIGDATA              |
| 10           | Sanjeev        | 51            | 99000            | FINANCE              |
+--------------+----------------+---------------+------------------+----------------------+--+

 

One Comment

  1. Pingback: Sampling in Hive – My IT Learnings

Leave a Reply

Your email address will not be published. Required fields are marked *