Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method. BTW, if anyone knows of a better way to go about this, let me know; this is less than slick.
In Hadoop intermediate compression is turned on in the following way:
JobConf conf = new JobConf(getConf(), myApp.class);
...
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");
I verified gzip compression was enabled in the map output stage in the following way:
1) In your conf/core-site.xml file (or hadoop-site.xml in version < 0.20.0) you may have defined your temporary data directory for hadoop. This will be specified as the value for the hadoop.tmp.dir name. I don’t believe it’s required that you define the location of this directory, but if you haven’t defined it you’ll have to determine its default location.
2) While a job is executing you will want to copy the contents of the hadoop temporary folder to a another temporary directory.
3) From this directory navigate to mapred/local/taskTracker/jobcache/job_*/attemp_*_m_*/output/. You should see a file called file.out. If compression is enabled, this file should be compressed. I used hexedit on this file to verify that the gzip header was indeed present ( the gzip header will start with 1F 8B 08).
4) Additionally the job.xml located at taskTracker/jobcache/job_*/attemp_*_m_* should have a field called mapred.compress.map.output which will be set to true. Obviously if you’ve already observed a compressed file, this field ought to be set true; if the file was not compressed you should see false for that mapped.compress.map.output field.

January 30th, 2012 at 3:20 pm
[...] A method to verify the intermediate map output data within a Hadoop cluster is being compressed usin… “Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method.” [...]
January 10th, 2013 at 11:41 am
You can check whether your code uses compression during intermediary steps by going to the web UI, under the job config and under the property for mapred.compress.map.output, you should check whether it is true or not. This way you don’t have to go to the folder and/or use hex editor to make sure it is compressed.
January 10th, 2013 at 11:53 am
@Artem – Thanks for the tip. However, this check of the Web UI just verifies that you’ve set compression to be used, not whether it actually is. Only by observing the intermediate data can you be certain.