How to compress Data in Hadoop

Hadoop is awesome because it can scale very well. That means you can add new data nodes without having to worry about running out of space. Go nuts with the data! Pretty soon you will realize that’s not a sustainable strategy… at least not financially. It is important to have a storage / retention strategy. Old data needs to be deleted or if nothing else, compressed as much as possible.

Here’s a simple way to compress a folder using Snappy codec via Hadoop Streaming.

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
  -Dmapred.reduce.tasks=0 \
  -input /user/your_user/path/to/large/directory \
  -output /user/your_user/path/to/compressed/directory
 
31
Kudos
 
31
Kudos

Now read this

Two Very Useful Hive CLI settings

It is very helpful to set these in your .hiverc file. The hive cli reads from the .hiverc file in your home directory to override defaults. Two of the settings I find very important is set hive.cli.print.header=true; set... Continue →