Hive: How To Create A Table From CSV Files in AWS S3
Jul 12th, 2015Say your CSV files are on Amazon S3 in the following directory:
s3://my-bucket/files/
Files can be plain text files or text files gzipped:
$ aws s3 ls s3://my-bucket/files/
2015-07-06 00:37:06 0
2015-07-06 00:37:17 74796978 file_a.txt.gz
2015-07-06 00:37:20 84324787 file_b.txt.gz
2015-07-06 00:37:22 85376585 file_b.txt.gz
To create a Hive table on top of those files, you have to specify the structure of the files by giving columns names and types.
CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
LOCATION 's3://my-bucket/files/';
Here is a list of all types allowed.
Excluding the first line of each CSV file
Most CSV files have a first line of headers, you can tell Hive to ignore it with TBLPROPERTIES
:
CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
LOCATION 's3://my-bucket/files/'
TBLPROPERTIES ("skip.header.line.count"="1");
Use a custom seperator in CSV files
To specify a custom field separator, say |
, for your existing CSV files:
CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/files/';
Flatten a nested directory structure
If your CSV files are in a nested directory structure, it requires a little bit of work to tell Hive to go through directories recursively. A simple solution is to programmatically copy all files in a new directory:
aws s3 ls --recursive s3://my-bucket/nested-structure/ \
| awk '{print $4}' \
| awk -F '/' '{print $1"/"$2" "$3" "$4}' \
| while read root dir filename; \
aws s3 cp s3://my-bucket/$root/$dir/$filename s3://my-bucket/flatten-structure/$filename; \
end
Replace an existing table
If the table already exists, there will be an error when trying to create it:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table my_table already exists)
To delete a table if it exists:
DROP TABLE IF EXISTS my_table;