Apache Hive is a data warehouse infrastructure for querying, analyzing and summarizing the data stored in Hadoop’s HDFS. It provides an SQL-like language called HiveQL with schema on read and implicitly converts queries to MapReduce, Tez or Spark jobs.
Some of the Hive features:-
- Different storage formats for data in Hive tables with support for compression.
- Table Partitioning.
- Indexing support.
- SQL like queries to process the data in tables. These queries are converted into MapReduce or Tez or Spark jobs.
- Rich set of built-in functions for string manipulation, data mining etc.
Hive allows a user to create tables, import data and then query that data using SQL like queries. Hive tables are created with a predefined schema. The schema is used to map the data in files with the table columns.
As Hive generates MapReduce Jobs to process a query, it will take some time for the results of the query to return. Hive should be used when the data to process is too large and the delay due to batch processing is acceptable. For processing real-time data to show real time results, Hive will show some lag.