Be careful with Hadoop and pbzip2!

Or you may be surprised at one day when you see that your output looks like it’s missing a lot of data. The problem affects Hadoop versions older than 1.4 (according to Jira) and is caused by the misinterpretation of EOS in compressed files, which is interpreted as EOF, so it – obviously – ends reading the file:

https://issues.apache.org/jira/browse/COMPRESS-146

So, if your Hadoop is misbehaving and your output data look odd without any reason – ask your admins if they didn’t change bzip2 to pbzip2.

Comments are closed.