Note: This is a really old post, and no longer demonstrates how to verify compression. As mentioned by Edan in the comments below, compression can be verified by observing the differences between “Map Output Bytes” and the “Map Output Materialized Bytes” at the conclusion of a job.
Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method. BTW, if anyone knows of a better way to go about this, let me know; this is less than slick. In Hadoop intermediate compression is turned on in the following way:
JobConf conf = new JobConf(getConf(), myApp.class);
I verified gzip compression was enabled in the map output stage in the following way: 1) In your conf/core-site.xml file (or hadoop-site.xml in version < 0.20.0) you may have defined your temporary data directory for hadoop. This will be specified as the value for the hadoop.tmp.dir name. I don’t believe it’s required that you define the location of this directory, but if you haven’t defined it you’ll have to determine its default location. 2) While a job is executing you will want to copy the contents of the hadoop temporary folder to a another temporary directory. 3) From this directory navigate to mapred/local/taskTracker/jobcache/job_*/attemp_*_m_*/output/. You should see a file called file.out. If compression is enabled, this file should be compressed. I used hexedit on this file to verify that the gzip header was indeed present ( the gzip header will start with 1F 8B 08). 4) Additionally the job.xml located at taskTracker/jobcache/job_*/attemp_*_m_* should have a field called mapred.compress.map.output which will be set to true. Obviously if you’ve already observed a compressed file, this field ought to be set true; if the file was not compressed you should see false for that mapped.compress.map.output field.