How to compress Data in Hadoop

Hadoop is awesome because it can scale very well. That means you can add new data nodes without having to worry about running out of space. Go nuts with the data! Pretty soon you will realize that’s not a sustainable strategy… at least not financially. It is important to have a storage / retention strategy. Old data needs to be deleted or if nothing else, compressed as much as possible.

Here’s a simple way to compress a folder using Snappy codec via Hadoop Streaming.

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
  -Dmapred.reduce.tasks=0 \
  -input /user/your_user/path/to/large/directory \
  -output /user/your_user/path/to/compressed/directory
 
31
Kudos
 
31
Kudos

Now read this

Two Very Useful Hive Lock Settings

hive.lock.numretries hive.lock.sleep.between.retries The first property indicates the number of times the client will attempt to get a lock before it gives up. The second property indicates that time interval between retries for getting... Continue →