Comparison of Storage formats in Hive – TEXTFILE vs ORC vs PARQUET

We will compare the different storage formats available in Hive. The comparison will be based on the size of theĀ  data on HDFS and time for executing a simple query.

Cluster summary

The performance is bench marked using a 5 node Hadoop cluster. Each node is a 8 core, 8 GB RAM, 500 GB Disk machine.

Data Summary

The data contains 30 columns and a total of 120 million rows. The total size of the uncompressed data in text format is 12.02 GB.

Comparison

FILE FORMAT HDFS SIZE MapReduce Total cumulative CPU time for Select count(*)
TEXTFILE 12.02 GB 349 seconds
ORC with NONE compression 2.53 GB 81 seconds
ORC with ZLIB compression 1.19 GB 69 seconds
ORC with SNAPPY compression 1.63 GB 75 seconds
PARQUET 1.73 GB 116 seconds
PARQUET with GZIP compression 1.25 GB 104 seconds

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *