May 9, 2014

Streaming data to Hadoop using Unix Pipes? Use Pipefail

If you pipe the output of a statement to hadoop streaming you must know about the unix pipefail option. To demonstrate what it does, try this out in your commandline:

$> true | false
$> echo $?
1
$> false | true
$> echo $?
0

ZOMG WTF why is that 0, the first command failed so the output of the entire command should be 1, no? By default, the return status of a pipeline is the return status of the last command. So if you have something like this:

$> mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

The exit code will be 0 even if the mysql command fails. You can force the return status of the pipeline to be 1 if any command in the pipeline fails.

$> set -o pipefail;mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

Kudos

Streaming data to Hadoop using Unix Pipes? Use Pipefail

Now read this

Accessing your gmail account from VPS using Mutt