Máirtín O Sullivan: December 2015

Wednesday, 30 December 2015

Book Review: Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Having never installed or played around with a Hadoop environment myself, I was on the look out for an intro style book that would give me the basics and enough info to start me off.

When browsing this one caught my eye as I didn’t even realise there was a Hadoop 2 and the title was pretty much spot on for what I was looking for so decided to give it a shot.

Overall, I enjoyed the book and it was spot on for what I was looking for. It’s a traditional tutorial/walk through type of book on how to get a Hadoop cluster up and running and how to admin/interact with it, but it also covers enough theory that you don’t need to have any prior experience with Hadoop to follow along.

However, I would say that I think it’s overpriced in the paper edition and retail price ebook so if you’re interested in this book, try and read it on Safari or get a Kindle edition to make it affordable. Other than that definitely recommended.

The book starts off with a really good overview of what Hadoop is, the MapReduce pattern and the changes in Hadoop 2. Good intro material.

The next chapter is a more traditional walk through on how to install Hadoop uses both the Hortonworks distribution and the Apache sources. It also covers use of Ambari for a simple web based admin console for your cluster. Nothing too detailed is explained here as it’s covered off later, but it’s a straight forward walk through so is spot on for that.

The third chapter gives a really good intro to how HDFS works, covering the nodes involved, their roles and the approach taken to replication and then some basic file system commands. I particularly enjoyed this chapter as I hadn’t used HDFS before and so some of the concepts around the different nodes, compute following data, append only files and block sizes were spot on for what I needed to understand.

The forth chapter covers running jobs and monitoring them in the web gui, along with some examples for base lining the performance of the cluster.

The fifth and sixth chapters walks through the MapReduce approach to data analysis, using word counting in text files as the main example and then moves on to the basics of writing code to create MapReduce jobs, covering the basics in Java and Python. Simple and straightforward, but again spot on in term of depth.

The seventh chapter runs through some of the other Apache tools within the Hadoop ecosystem, covering Pig, Hive, Sqoop, Flume, Oozie and HBase. These are just quick overviews but interesting as I wasn’t aware of some of these.

The eight chapter is really nice in that it focuses exclusively on YARN (Yet Another Resource Negotiator), which is new to Hadoop 2 and is one of the big differences in the new version. It walks through how to use YARN for things other than the traditional MapReduce pattern, using the YARN distributed shell as an example, before touching briefly on how some of the other Apache tools can be used with YARN.

The last two chapters focus on admining Hadoop through the commands required and the Ambari interface. I skimmed these as I’m only doing a very basic setup to get my head around Hadoop but would look back to these as needed.

In summary, the author notes initially that this book is written to a "hello world" level in terms of depth and that’s spot on across the book. It gives you enough info to get you to a working example, and then it’s up to you. I really liked this analogy and it’s exactly the level I was looking for. I also liked the author’s style of writing so will also be going looking for more of his book to find some more advanced material on Hadoop.

If you looking for an intro to Hadoop that’s a nice combination of both theory and high level tech implementation, then this is definitely worth a read.

One thing I would say is that I got through the book very quickly (3 hours roughly), and was surprised to see when I checked Amazon that the paper version is just over 300 pages as it really didn’t feel like that. It reads more like a book of around 150 pages, which in my head makes sense for quick start book.

Why I highlight this is that while I really enjoyed the book, as I mention earlier, I don’t think it’s worth the price of $27 that the paper version is currently retailing for. For me it’s more in the $15 - $18 bracket and so if you’re going to read this then definitely try and go for the Kindle edition which is worth it at $17.

Links:
Amazon: http://www.amazon.com/Hadoop-Quick-Start-Guide-Essentials-Addison-Wesley/dp/0134049942
Safari: https://www.safaribooksonline.com/library/view/hadoop-2-quick-start/9780134050119/

Monday, 28 December 2015

Book Review: Creating A Data-Driven Organization

I stumbled across this book while browsing and it’s title obviously jumped out to me as I'm always interested in anything to help quantify analysis or build data driven approaches to what I do.

I wasn't entirely sure what to expect but in summary, it's a really enjoyable, easy read on how to build data-driven teams and the culture to support them in an organisation.

The book starts off by establishing what the author really means by data-driven, touching on some of the fundamentals of data quality, collection and analysis.

After these initial chapters the book really got interesting for me as it starts to look at the organisational and cultural consideration of building a data-driven program.

The author first outlines the different skillets required for a rounded data-driven analysis team, covering skillets like business skills, programming, devops, stats, visualisation, machine learning and big data analysis. I really liked how the author shows these as complementary skills across the team, but highlights that your team don't need to be experts at all.

One really nice aspect is that the need for strong visualisation is hishlighted immediately, specifically in relation to it’s role in not just performing data analysis, but selling it the rest of the organisation. This is further later on in the book through a whole chapter on visualation, including how it can/should be used effectively, covering a lot of the ideas from Tufte, etc in a really nicely summarised form.

The author then moves on to describe the different types of data-analysis, how they are used and then works through some discussion around metrics and A/B testing as core examples of how data analysis can be applied to business contexts.

The next three chapters cover what I think to be the most important aspect of the whole book; the approach of decision making and it’s effect on data-driven approaches, the key comments of a data-driven culture within an organisation and the role of the C-suite in establishing this culture. These chapters outline many of the key cultural challenges to moving towards a more data-driven approach and are great reads for anyone who may be pushing for more data analysis within their organisation, but it struggling to get traction.

The book finishes out with a chapter on privacy, ethics and risk, which obviously as a security guy I love to see. I particularly like the “ick” factor approach that the author outlines to dealing with data analysis and privacy.

Overall I think this book is a great introduction to a lot of topics relating to data analysis and data driven decision making, and incorporates some really good lessons on organisational structure, culture, skillets and challenges with adopting data-driven approaches within organisations.

The author highlights thoughout that this book doesn’t touch on the tools or technology used for data analysis, or details on data analysis approaches, as these are covered in many other books, which are referenced at needed. So if you're looking for this type of material, definitely go elsewhere.

However, if you’re new to applying data-driven approaches to your field (IT, business or otherwise) or if you’re a manager or leader looking to understand how you can affect change within your organisation towards a data driven approach, I'd highly recommend this.

Links:
Amazon: http://www.amazon.com/Creating-Data-Driven-Organization-Carl-Anderson/dp/1491916915
Safari: https://www.safaribooksonline.com/library/view/creating-a-data-driven/9781491916902/