How to compress Data in Hadoop

Hadoop is awesome because it can scale very well. That means you can add new data nodes without having to worry about running out of space. Go nuts with the data! Pretty soon you will realize that’s not a sustainable strategy… at least not financially. It is important to have a storage / retention strategy. Old data needs to be deleted or if nothing else, compressed as much as possible.

Here’s a simple way to compress a folder using Snappy codec via Hadoop Streaming.

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
  -Dmapred.reduce.tasks=0 \
  -input /user/your_user/path/to/large/directory \
  -output /user/your_user/path/to/compressed/directory
 
31
Kudos
 
31
Kudos

Now read this

Removing Database Level Locks in HIVE

Recently we started noticing “CREATE TABLE” taking incredibly long amounts of time to execute and after which they’d fail. A more detailed look in to the issue revealed that we had upgraded HIVE and the new version, which now allowed... Continue →