Hive: How To Create A Table From CSV Files in AWS S3

Say your CSV files are on Amazon S3 in the following directory:

s3://my-bucket/files/

Files can be plain text files or text files gzipped:

$ aws s3 ls s3://my-bucket/files/
2015-07-06 00:37:06          0
2015-07-06 00:37:17   74796978 file_a.txt.gz
2015-07-06 00:37:20   84324787 file_b.txt.gz
2015-07-06 00:37:22   85376585 file_b.txt.gz

To create a Hive table on top of those files, you have to specify the structure of the files by giving columns names and types.

CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
LOCATION 's3://my-bucket/files/';

Here is a list of all types allowed.

Excluding the first line of each CSV file

Most CSV files have a first line of headers, you can tell Hive to ignore it with TBLPROPERTIES:

CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
LOCATION 's3://my-bucket/files/'
TBLPROPERTIES ("skip.header.line.count"="1");

Use a custom seperator in CSV files

To specify a custom field separator, say |, for your existing CSV files:

CREATE EXTERNAL TABLE posts (title STRING, comment_count INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/files/';

Flatten a nested directory structure

If your CSV files are in a nested directory structure, it requires a little bit of work to tell Hive to go through directories recursively. A simple solution is to programmatically copy all files in a new directory:

aws s3 ls --recursive s3://my-bucket/nested-structure/ \
| awk '{print $4}' \
| awk -F '/' '{print $1"/"$2" "$3" "$4}' \
| while read root dir filename; \
aws s3 cp s3://my-bucket/$root/$dir/$filename s3://my-bucket/flatten-structure/$filename; \
end

Replace an existing table

If the table already exists, there will be an error when trying to create it:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table my_table already exists)

To delete a table if it exists:

DROP TABLE IF EXISTS my_table;

Comments