Verifying Intermediate Map Output Data is Compressed in Hadoop

Note: This is a really old post, and no longer demonstrates how to verify compression.  As mentioned by Edan in the comments below, compression can be verified by observing the differences between “Map Output Bytes” and the “Map Output Materialized Bytes” at the conclusion of a job.

If you’re looking for the actual intermediate compressed data on the local file system, Hadoop’s “mapred.local.dir” configuration value details that location.  The data, however, is temporary and is removed after a job completes.

Recently while setting up a Hadoop cluster, I wanted to verity gzip compression was being used during the map output phase. I couldn’t find anything online on how to do this, so I discovered the following method. BTW, if anyone knows of a better way to go about this, let me know; this is less than slick.

In Hadoop intermediate compression is turned on in the following way:

JobConf conf = new JobConf(getConf(), myApp.class);
conf.set("", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("", "");

I verified gzip compression was enabled in the map output stage in the following way:

1) In your conf/core-site.xml file (or hadoop-site.xml in version < 0.20.0) you may have defined your temporary data directory for hadoop. This will be specified as the value for the hadoop.tmp.dir name. I don’t believe it’s required that you define the location of this directory, but if you haven’t defined it you’ll have to determine its default location.

2) While a job is executing you will want to copy the contents of the hadoop temporary folder to a another temporary directory.

3) From this directory navigate to mapred/local/taskTracker/jobcache/job_*/attemp_*_m_*/output/. You should see a file called file.out. If compression is enabled, this file should be compressed. I used hexedit on this file to verify that the gzip header was indeed present ( the gzip header will start with 1F 8B 08).

4) Additionally the job.xml located at taskTracker/jobcache/job_*/attemp_*_m_* should have a field called which will be set to true. Obviously if you’ve already observed a compressed file, this field ought to be set true; if the file was not compressed you should see false for that field.

4 comments on “Verifying Intermediate Map Output Data is Compressed in Hadoop

  1. You can check whether your code uses compression during intermediary steps by going to the web UI, under the job config and under the property for, you should check whether it is true or not. This way you don’t have to go to the folder and/or use hex editor to make sure it is compressed.

  2. @Artem – Thanks for the tip. However, this check of the Web UI just verifies that you’ve set compression to be used, not whether it actually is. Only by observing the intermediate data can you be certain.

  3. I was curious about the same thing. I found it easiest to compare “Map output bytes” to “Map output materialized bytes” … In my tests “Map output bytes” was the real size (10GB) and the “materialized bytes” was compresed (2GB).

    You can also check your reducer’s tasks logs, you should see a line like: Got brand-new decompressor

    It’s still not as good perhaps as really verifying the data on disk, but it’s certainly better than just observing the configuration is set and easier than digging around in the job directories … a nice compromise I think 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *