Planning and Communicating Your Cluster Design

When creating a new Amazon Web Services (AWS) hadoop cluster it is overwhelming for most people to put together a configuration plan or topology.  Below is a Hadoop reference architecture template I’ve built that can be filled in that addresses the key aspects of planning, building, configuring, and communicating your hadoop cluster on AWS.

I’ve done this many times and as part of my focus on tools and templates thought I’d add a template you can use as a basic guideline for planning your Cloudera big data cluster.  The template includes configurations for:

  • instance basics
  • instance list
  • storage
  • operating system
  • CDH version
  • the cluster topology
  • metastore detail for hive, YARN, hue, impala, sqoop, oozie, and Cloudera Manager
  • high-availability
  • resource management
  • and additional detail for custom service descriptors (CSD) for Storm and Redis

No Warranty Expressed or Implied

It’s not meant to be exhaustive as there are many items not covered (AWS security groups, network optimization, dockerization, continuous integration, monitors, etc.) but it is an example of a real-world cluster in AWS (details of instance and AZ changed for security).

Screenshot of the roles and services in the big data design template
Example list of EC2 instances for the cluster plan

Cloudera hadoop reference architecture  configuration template for Amazon Web Services (AWS)

AWS_topology_template

Please feel free to let me know how it works for you and if you have any improvements for it.

The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013.  The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap.  The charter for the working group:

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperabilityreusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitionstaxonomiesreference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov

 

To participate in helping the US Government in their efforts, sign up at http://bigdatawg.nist.gov/home.php