Streaming data to Hadoop using Unix Pipes? Use Pipefail

If you pipe the output of a statement to hadoop streaming you must know about the unix pipefail option. To demonstrate what it does, try this out in your commandline:

$> true | false
$> echo $?
1
$> false | true
$> echo $?
0

ZOMG WTF why is that 0, the first command failed so the output of the entire command should be 1, no? By default, the return status of a pipeline is the return status of the last command. So if you have something like this:

$> mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

The exit code will be 0 even if the mysql command fails. You can force the return status of the pipeline to be 1 if any command in the pipeline fails.

$> set -o pipefail;mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster
 
18
Kudos
 
18
Kudos

Now read this

Hive doesn’t like the carriage return character

Have you ever run in to a situation where you count the number of rows for a table in a database, then dump it to CSV and then load it to HIVE only to find that number has changed? Well, you probably have carriage returns in your fields.... Continue →