Jan 29

Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method. BTW, if anyone knows of a better way to go about this, let me know; this is less than slick.

In Hadoop intermediate compression is turned on in the following way:

JobConf conf = new JobConf(getConf(), myApp.class);
...
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");

I verified gzip compression was enabled in the map output stage in the following way:

1) In your conf/core-site.xml file (or hadoop-site.xml in version < 0.20.0) you may have defined your temporary data directory for hadoop. This will be specified as the value for the hadoop.tmp.dir name. I don’t believe it’s required that you define the location of this directory, but if you haven’t defined it you’ll have to determine its default location.

2) While a job is executing you will want to copy the contents of the hadoop temporary folder to a another temporary directory.

3) From this directory navigate to mapred/local/taskTracker/jobcache/job_*/attemp_*_m_*/output/. You should see a file called file.out. If compression is enabled, this file should be compressed. I used hexedit on this file to verify that the gzip header was indeed present ( the gzip header will start with 1F 8B 08).

4) Additionally the job.xml located at taskTracker/jobcache/job_*/attemp_*_m_* should have a field called mapred.compress.map.output which will be set to true. Obviously if you’ve already observed a compressed file, this field ought to be set true; if the file was not compressed you should see false for that mapped.compress.map.output field.

4 Responses to “Verifying Intermediate Map Output Data is Compressed in Hadoop”

  1. Pixel Cloud Studios » Verifying Gzip Compression in Hadoop Says:

    [...] A method to verify the intermediate map output data within a Hadoop cluster is being compressed usin… “Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method.” [...]

  2. Artem Ervits Says:

    You can check whether your code uses compression during intermediary steps by going to the web UI, under the job config and under the property for mapred.compress.map.output, you should check whether it is true or not. This way you don’t have to go to the folder and/or use hex editor to make sure it is compressed.

  3. admin Says:

    @Artem – Thanks for the tip. However, this check of the Web UI just verifies that you’ve set compression to be used, not whether it actually is. Only by observing the intermediate data can you be certain.

  4. Edan Says:

    I was curious about the same thing. I found it easiest to compare “Map output bytes” to “Map output materialized bytes” … In my tests “Map output bytes” was the real size (10GB) and the “materialized bytes” was compresed (2GB).

    You can also check your reducer’s tasks logs, you should see a line like:

    org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor

    It’s still not as good perhaps as really verifying the data on disk, but it’s certainly better than just observing the configuration is set and easier than digging around in the job directories … a nice compromise I think :)

Leave a Reply

preload preload preload