Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method. BTW, if anyone knows of a better way to go about this, let me know; this is less than slick.
In Hadoop intermediate compression is turned on in the following way:
JobConf conf = new JobConf(getConf(), myApp.class);
I verified gzip compression was enabled in the map output stage in the following way:
1) In your conf/core-site.xml file (or hadoop-site.xml in version < 0.20.0) you may have defined your temporary data directory for hadoop. This will be specified as the value for the hadoop.tmp.dir name. I don’t believe it’s required that you define the location of this directory, but if you haven’t defined it you’ll have to determine its default location.
2) While a job is executing you will want to copy the contents of the hadoop temporary folder to a another temporary directory.
3) From this directory navigate to mapred/local/taskTracker/jobcache/job_*/attemp_*_m_*/output/. You should see a file called file.out. If compression is enabled, this file should be compressed. I used hexedit on this file to verify that the gzip header was indeed present ( the gzip header will start with 1F 8B 08).
4) Additionally the job.xml located at taskTracker/jobcache/job_*/attemp_*_m_* should have a field called mapred.compress.map.output which will be set to true. Obviously if you’ve already observed a compressed file, this field ought to be set true; if the file was not compressed you should see false for that mapped.compress.map.output field.