Streaming data to Hadoop using Unix Pipes? Use Pipefail

If you pipe the output of a statement to hadoop streaming you must know about the unix pipefail option. To demonstrate what it does, try this out in your commandline:

$> true | false
$> echo $?
1
$> false | true
$> echo $?
0

ZOMG WTF why is that 0, the first command failed so the output of the entire command should be 1, no? By default, the return status of a pipeline is the return status of the last command. So if you have something like this:

$> mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

The exit code will be 0 even if the mysql command fails. You can force the return status of the pipeline to be 1 if any command in the pipeline fails.

$> set -o pipefail;mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster
 
22
Kudos
 
22
Kudos

Now read this

Boilerplate Maven Pom for generating jars with dependencies

I find myself searching for this over and over again. Maven can be a pain in the butt and Gradle is supposed to be a huge improvement. But every now and then you have to work with Maven, here’s a useful gist which contains boilerplate to... Continue →