Be careful with Hadoop and pbzip2!

Or you may be surprised at one day when you see that your output looks like it’s missing a lot of data. The problem affects Hadoop versions older than 1.4 (according to Jira) and is caused by the misinterpretation of EOS in compressed files, which is interpreted as EOF, so it – obviously – ends reading the file:

https://issues.apache.org/jira/browse/COMPRESS-146

So, if your Hadoop is misbehaving and your output data look odd without any reason – ask your admins if they didn’t change bzip2 to pbzip2.

Mounting HDFS cluster as a block device with hadoop-fuse

Using Hadoop may quickly become very annoying if you have to navigate through the HDFS filesystem with a standard hadoop command. As a Linux user I got used to TAB-autocompletion feature, which lets me quickly and easily use my filesystem so I was really disappointed with this difficulty. Luckily – there’s a solution which eased my pain!

Continue reading