We will see how to create an external table in Hive and how to import data into the table.
External Table
External tables in Hive do not store data for the table in the hive warehouse directory. External table in Hive stores only the metadata about the table in the Hive metastore. Any directory on HDFS can be pointed to as the table data while creating the external table. All files inside the directory will be treated as table data.
When external table is deleted, only the table metadata from the hive metastore is deleted. The directory containing the data remains intact.
External tables allows a user to manage data outside of hive.
Data on HDFS
We have a directory ‘/home/hadoop/employee_data/‘ containing a data file employees_1.csv. We will treat this directory as location for external table Employee.
Contents of employees_1.csv:-
1,aarti,28,30000 2,sakshi,22,20000 3,mahesh,25,25000
Create Table
We will create an external table Employee by using the directory ‘/home/hadoop/employee_data/‘ on HDFS as data location. We need to use EXTERNAL keyword while creating an external table :-
CREATE EXTERNAL TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT ) COMMENT 'This is External table Employee using textfile format' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/home/hadoop/employee_data';
Verify Data
Once the table is created, we can see that the data from the files in the HDFS directory can be queried as Employee table’s data.
select * from Employee; +--------------+----------------+---------------+------------------+--+ | employee.id | employee.name | employee.age | employee.salary | +--------------+----------------+---------------+------------------+--+ | 1 | aarti | 28 | 30000 | | 2 | sakshi | 22 | 20000 | | 3 | mahesh | 25 | 25000 | +--------------+----------------+---------------+------------------+--+
Add Data to the Table by Dumping Files into HDFS Directory
We have a file ‘/home/hadoop/employees_2.csv’ on HDFS. Now let us copy this file in the ‘/home/hadoop/employee_data/‘ directory.
Contents of employees_2.csv :-
10001,rajesh,29,50000 10002,rahul,23,25000 10003,dinesh,35,70000
Lets see if the data from newly dumped file is visible as Employee table’s data.
select * from Employee; +--------------+----------------+---------------+------------------+--+ | employee.id | employee.name | employee.age | employee.salary | +--------------+----------------+---------------+------------------+--+ | 1 | aarti | 28 | 30000 | | 2 | sakshi | 22 | 20000 | | 3 | mahesh | 25 | 25000 | | 10001 | rajesh | 29 | 50000 | | 10002 | rahul | 23 | 25000 | | 10003 | dinesh | 35 | 70000 | +--------------+----------------+---------------+------------------+--+
Add Data to the Table by Using LOAD DATA Query
We can add data to the table by using LOAD Data query. We have a file ‘/home/hadoop/employees_3.csv’ on HDFS. We can directly copy this file into ‘/home/hadoop/employee_data/‘ directory or we can use LOAD DATA query.
Contents of employees_3.csv :-
20001,john,38,36000 20002,sam,32,28000 20003,joe,35,29000
Load Data :-
LOAD DATA INPATH '/home/hadoop/employees_3.csv' INTO TABLE Employee;
We can see that the file employees_3.csv gets copied in the ‘/home/hadoop/employee_data/‘ directory.
Verify Data :-
select * from Employee; +--------------+----------------+---------------+------------------+--+ | employee.id | employee.name | employee.age | employee.salary | +--------------+----------------+---------------+------------------+--+ | 1 | aarti | 28 | 30000 | | 2 | sakshi | 22 | 20000 | | 3 | mahesh | 25 | 25000 | | 10001 | rajesh | 29 | 50000 | | 10002 | rahul | 23 | 250000 | | 10003 | dinesh | 35 | 70000 | | 20001 | john | 38 | 36000 | | 20002 | sam | 32 | 28000 | | 20003 | joe | 35 | 29000 | +--------------+----------------+---------------+------------------+--+
Drop Table
Dropping the External table does not delete data from the ‘/home/hadoop/employee_data/‘ directory. All the data inside this directory remains intact.
drop table Employee;