Wednesday, 10 February 2016

Book Review: Hadoop Security

As a follow up to my basic into to Hadoop with Hadoop 2 Quick Start Guide, I wanted to get more detail on the security features available in the Hadoop ecosystem and this sounded like it fitted the bill and was recently published (June 2015) so figured it would be pretty up to date.

One thing that I immediately liked about the book is that apart from a very brief few pages of an intro to security concepts, it get straight into things, which for me is always a good indication taht there won't be much padding in the book.

The book first starts off with a section on security architecture starting with a basic look threat modelling for distributed systems, which is a nice touch as really threat modelling should be part of any security architecture discussion and even touching on this at a high level is great, as is puts the whole book in context.

The next chapter moves onto general security architectures in a Hadoop environment, covering network level segregation, OS level security and an overview of the different types of Hadoop node roles. This was a great start to the book as immediately it starts working through the different nodes, what user roles need access to them, what nodes can be segregated from direct access and how at a high level they interact for data loads and job submission.

The final chapter of the architecture section finishes up with an overview of Kerberos, which while initially seemed a bit strange, it becomes obvious why later on as Kerberos plays such a key role in Hadoop security. If you need to get up to speed quickly on Kerberos, I’d highly recommend Kerberos: A Network Authentication System… it’s a quick and easy read that I read over ten years ago and it’s still as good now as it was then.

The next section deep dives more into authentication and at this point the book gets straight into the hands on configuration guide, covering detailed configuration steps required to map Kerberos principles into the Hadoop world, how to map to local users, how user groups work in Hadoop and mapping to LDAP groups. The chapter then moves on to cover the various authentication protocols in use across the Hadoop ecosystem, before explaining the differences between simple and Kerberos authn and then a nice dive into token auth, including the flows of how delegation tokens are created to allow various systems to impersonate users. The chapter finishes off with a fully worked Kerberos authn configuration guide, which to be fair I skimmed over as I don’t need that level of detail at the moment.

The next chapter moves onto authz covering HDFS ACLs and extended ACLs and various service level authorisations before moving on to MapReduce (1 and 2) and YARN, and Zookeeper ACLs, HBase, and Oozie. There’s a few nice worked examples here of the effects of authz restrictions and what errors users will see when their access is restricted.

The book then moves on to cover Sentry, which is Apache’s attempt to centralise authz within the Hadoop ecosystem, which after reading through the previous few chapters it’s obvious it’s needed! The basic architecture on which Sentry works is covered and how it integrates with the various applications and then walks though how to configure each application to use Sentry. Again a very practical oriented approach is taken here with a lot of detail on the configuration steps.

The last chapter in this section covers the logging available by default in each of the various applications and their basic config. This is a quick chapter and really just goes to show the configuration aspect, rather than any analysis approaches to the logs.

The third section of the book moves onto data security, specifically to cover encryption of data in transit and at-rest, starting with great coverage of how HDFS file encryption works. What was particularly good in this chapter was the strong emphasis it places on the key management and also making the reader conscious of potential lack of encryption on temporary data such as logs. The second half of the chapter covers encryption of data in transit, mainly focusing on the configuration of SSL/TLS in the various applications in the ecosystem.

The next chapter is a short one and looks at security of data as it is loaded into the Hadoop ecosystem, covering both the confidentiality and integrity of the data, but mainly focusing on confidentiality/encryption. The following chapter then covers how client access of data in the Hadoop environment can be performed securely, focusing of course on the edge nodes and how users interact with them, through command line RPC or APIs. From an architecture perspective, I found this chapter particularly helpful as it does a good job of describing the trust boundary that will exist in most deployments and how this should be architected securely.

The last chapter in this section covers Cloudera Hue and to be honest I just skimmed this one as it wasn’t relevant to me.

The final section of the book covers some use cases nicely, outlining scenarios with business and security requirements, before walking through how to architect and configure the right mix of controls to meet the requirements. For me I would have loved more examples here as this is more at the level I’m working at, rather than the technical configuration. But still, great to see it presented in this way.

Overall, this was a great book that to be fair goes into a lot more depth in terms of technical configuration settings than I needed. This can make it a tough read if you’re just looking for the high level, however, if you’re setting up a Hadoop cluster then this should be your go-to book.

However, it also works great at the level I was looking for as it's got a strong focus on architecture considerations and puts the security functionality into context rather than just explaining the feature sets available. You just may need to skim some of the more detailed sections like I did!

Links:
Amazon: http://www.amazon.com/Hadoop-Security-Protecting-Your-Platform/dp/1491900989/
Safari: https://www.safaribooksonline.com/library/view/hadoop-security/9781491900970/


No comments:

Post a Comment