Sampling in Hive
Sampling Sampling is concerned with the selection of a subset of data from a large dataset to run queries and verify results. The dataset may be too large to run queries on the whole data. Therefore in development and testing phases it is a good idea to run queries on…
Running Sampling Queries in Hive
We will see how to run sampling queries in Hive. Hive Table We have the following table Employee in Hive, bucketed by ID into 5 buckets:- CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT, DEPARTMENT STRING ) COMMENT ‘This is Employee table stored as textfile clustered by…
Complex data type in Hive: Map
Map – a complex data type in Hive which can store Key-Value pairs. Values from a map can be accessed using the keys. Create Table While creating a table with Map data type, we need to specify the – ‘COLLECTION ITEMS TERMINATED BY’ character to specify different key-value pairs. ‘MAP KEYS…
Complex data type in Hive: Struct
Struct – a complex data type in Hive which can store a set of fields of different data types. The elements of a struct are accessed using dot notation. Create Table While creating a table with Struct data type, we need to specify the ‘COLLECTION ITEMS TERMINATED BY’ character. This…
Complex data type in Hive: Array
Array – a complex data type in Hive which can store an ordered collection of similar elements accessible using 0 based index. Create Table While creating a table with Array data type, we need to specify the ‘COLLECTION ITEMS TERMINATED BY’ character. This character will be used to specify different…
Basic Data types in Hive
We will see the Basic data types in Hive. Numeric Types TINYINT 1 byte signed integer. Values range -128 to 127 SMALLINT 2 byte signed integer. Values range -32,768 to 32,767 INT 4 byte signed integer. Values range -2,147,483,648 to 2,147,483,647 BIGINT 8 byte signed integer. Values range -9,223,372,036,854,775,808 to…
Partitioning vs Bucketing in Hive
We will see some of the differences between partitioning and bucketing in Hive. Partitioning Partitioning is used to divide the table into different partitions. Each partition is stored as a different directory. A partition is created for each unique value of the partition column. Hierarchical partitioning can be done by…
Creating Bucketed and Sorted Table in Hive and Inserting Data
Create Table A bucketed and sorted table stores the data in different buckets and the data in each bucket is sorted according to the column specified in the SORTED BY clause while creating the table. For creating a bucketed and sorted table, we need to use CLUSTERED BY (columns) SORTED…
Creating Bucketed Table in Hive and Inserting Data
Create Table For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets. CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY…
Bucketing in Hive
Bucketing Bucketing is a method to evenly distributed the data across many files. Create multiple buckets and then place each record into one of the buckets based on some logic mostly some hashing algorithm. Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such…