Hive is a great bigdata tool for the capability of translating SQL querying into a series of Map-Reduce tasks. In some cases, loading data into Hive becomes an issue to be solved. A tipical situation comes as followed: a large amount of structed data was generated by some process and was saved on HDFS, and now, you are finding a way to build database upon the mentioned data, and thus it could be handled by SQL querying. The problem is that, due to the big volume of data, the high cost of moving data from the birth place to Hive data directory could be ineluctable.
In the following parts of this post, a practical solution would be presented.
the Generation of Data
First of all, you'll have to dump your data onto HDFS with a specific structed format.
For example, I dumped GZip-ed, tab-sperating, 4-column data, generated by Spark Streaming tasks, to HDFS location: hdfs://namenode/path/to/data/<date>/<hour>/<dstreamid>
. Here, <date>
and <hour>
is the date and hour of generating, respectively, and <dstreamid>
is the ID of Direct Stream of Spark Streaming. As a consequence, the structure of folder shall be similar to that listed below:
1 | hdfs://namenode/path/to/data/2019-11-01/13/123456/_SUCCESS |
Creating Hive Table
Having data on HDFS folder, we are going to build a Hive table which is compatible with the format of data. For the reason that data moving while loading data into Hive table is not expected, an external table shall be created.
1 | CREATE EXTERNAL TABLE `table_name` ( |
Here:
- there are four fields named from
field_1
tofield_4
(you can modify them on demand); - there are three fields used for partition, and they are
date
/hour
/dstreamid
; normally, they should be consist to the sub-folder name when saving data onto HDFS; - field delimiter is
\t
, i.e. tab seperator; - the format of input is
org.apache.hadoop.mapred.TextInputFormat
; - the location of data of this partition is
hdfs://namenode/path/to/data/
, the full HDFS path of respective data; - the name of table is
table_name
, which could be modified on demand.
Loading Data into Hive Table
Having date generated and table created, we are well prepared to loading data into table. Here, ALTER TABLE
clause will be applied.
1 | ALTER TABLE |
This instructs Hive to:
- alter the table named
table_name
; - add a partition, if the specified partition is not exists currently;
- the location of data specified is
hdfs://namenode/path/to/data/2019-11-01/13/123456
.
After execution of the SQL, the HDFS folder is loaded as a partition of Hive external table, without data moving.