How to compress Data in Hadoop

Hadoop is awesome because it can scale very well. That means you can add new data nodes without having to worry about running out of space. Go nuts with the data! Pretty soon you will realize that’s not a sustainable strategy… at least not financially. It is important to have a storage / retention strategy. Old data needs to be deleted or if nothing else, compressed as much as possible.

Here’s a simple way to compress a folder using Snappy codec via Hadoop Streaming.

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
  -Dmapred.reduce.tasks=0 \
  -input /user/your_user/path/to/large/directory \
  -output /user/your_user/path/to/compressed/directory
 
31
Kudos
 
31
Kudos

Now read this

Accessing your gmail account from VPS using Mutt

I spent several frustrating hours trying to figure out what I was doing wrong after installing and configuring Mutt for Gmail. $ mutt -s "Tester" bot.dude@gmail.com < /tmp/mail.txt msmtp: authentication failed (method PLAIN) msmtp:... Continue →