We will compare the different storage formats available in Hive. The comparison will be based on the size of theĀ data on HDFS and time for executing a simple query.
Cluster summary
The performance is bench marked using a 5 node Hadoop cluster. Each node is a 8 core, 8 GB RAM, 500 GB Disk machine.
Data Summary
The data contains 30 columns and a total of 120 million rows. The total size of the uncompressed data in text format is 12.02 GB.
Comparison
FILE FORMAT | HDFS SIZE | MapReduce Total cumulative CPU time for Select count(*) |
---|---|---|
TEXTFILE | 12.02 GB | 349 seconds |
ORC with NONE compression | 2.53 GB | 81 seconds |
ORC with ZLIB compression | 1.19 GB | 69 seconds |
ORC with SNAPPY compression | 1.63 GB | 75 seconds |
PARQUET | 1.73 GB | 116 seconds |
PARQUET with GZIP compression | 1.25 GB | 104 seconds |