Wednesday 30 December 2015

Book Review: Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Having never installed or played around with a Hadoop environment myself, I was on the look out for an intro style book that would give me the basics and enough info to start me off.

When browsing this one caught my eye as I didn’t even realise there was a Hadoop 2 and the title was pretty much spot on for what I was looking for so decided to give it a shot.

Overall, I enjoyed the book and it was spot on for what I was looking for. It’s a traditional tutorial/walk through type of book on how to get a Hadoop cluster up and running and how to admin/interact with it, but it also covers enough theory that you don’t need to have any prior experience with Hadoop to follow along.

However, I would say that I think it’s overpriced in the paper edition and retail price ebook so if you’re interested in this book, try and read it on Safari or get a Kindle edition to make it affordable. Other than that definitely recommended.

The book starts off with a really good overview of what Hadoop is, the MapReduce pattern and the changes in Hadoop 2. Good intro material.

The next chapter is a more traditional walk through on how to install Hadoop uses both the Hortonworks distribution and the Apache sources. It also covers use of Ambari for a simple web based admin console for your cluster. Nothing too detailed is explained here as it’s covered off later, but it’s a straight forward walk through so is spot on for that.

The third chapter gives a really good intro to how HDFS works, covering the nodes involved, their roles and the approach taken to replication and then some basic file system commands. I particularly enjoyed this chapter as I hadn’t used HDFS before and so some of the concepts around the different nodes, compute following data, append only files and block sizes were spot on for what I needed to understand.

The forth chapter covers running jobs and monitoring them in the web gui, along with some examples for base lining the performance of the cluster.

The fifth and sixth chapters walks through the MapReduce approach to data analysis, using word counting in text files as the main example and then moves on to the basics of writing code to create MapReduce jobs, covering the basics in Java and Python. Simple and straightforward, but again spot on in term of depth.

The seventh chapter runs through some of the other Apache tools within the Hadoop ecosystem, covering Pig, Hive, Sqoop, Flume, Oozie and HBase. These are just quick overviews but interesting as I wasn’t aware of some of these.

The eight chapter is really nice in that it focuses exclusively on YARN (Yet Another Resource Negotiator), which is new to Hadoop 2 and is one of the big differences in the new version. It walks through how to use YARN for things other than the traditional MapReduce pattern, using the YARN distributed shell as an example, before touching briefly on how some of the other Apache tools can be used with YARN.

The last two chapters focus on admining Hadoop through the commands required and the Ambari interface. I skimmed these as I’m only doing a very basic setup to get my head around Hadoop but would look back to these as needed.

In summary, the author notes initially that this book is written to a "hello world" level in terms of depth and that’s spot on across the book. It gives you enough info to get you to a working example, and then it’s up to you. I really liked this analogy and it’s exactly the level I was looking for. I also liked the author’s style of writing so will also be going looking for more of his book to find some more advanced material on Hadoop.

If you looking for an intro to Hadoop that’s a nice combination of both theory and high level tech implementation, then this is definitely worth a read.

One thing I would say is that I got through the book very quickly (3 hours roughly), and was surprised to see when I checked Amazon that the paper version is just over 300 pages as it really didn’t feel like that. It reads more like a book of around 150 pages, which in my head makes sense for quick start book.

Why I highlight this is that while I really enjoyed the book, as I mention earlier, I don’t think it’s worth the price of $27 that the paper version is currently retailing for. For me it’s more in the $15 - $18 bracket and so if you’re going to read this then definitely try and go for the Kindle edition which is worth it at $17.

Links:
Amazon: http://www.amazon.com/Hadoop-Quick-Start-Guide-Essentials-Addison-Wesley/dp/0134049942
Safari: https://www.safaribooksonline.com/library/view/hadoop-2-quick-start/9780134050119/

5 comments: