AWS configuration and topology template for Cloudera Hadoop

Planning and Communicating Your Cluster Design

When creating a new Amazon Web Services (AWS) hadoop cluster it is overwhelming for most people to put together a configuration plan or topology.

I’ve done this many times and as part of my focus on tools and templates thought I’d add a template you can use as a basic guideline for planning your Cloudera big data cluster.  The template includes configurations for:

  • instance basics
  • instance list
  • storage
  • operating system
  • CDH version
  • the cluster topology
  • metastore detail for hive, YARN, hue, impala, sqoop, oozie, and Cloudera Manager
  • high-availability
  • resource management
  • and additional detail for custom service descriptors (CSD) for Storm and Redis

No Warranty Expressed or Implied

It’s not meant to be exhaustive as there are many items not covered (AWS security groups, network optimization, dockerization, continuous integration, monitors, etc.) but it is an example of a real-world cluster in AWS (details of instance and AZ changed for security).

Screenshot of the roles and services in the big data design template
Example list of EC2 instances for the cluster plan

Cloudera hadoop cluster configuration template for Amazon Web Services (AWS)


Please feel free to let me know how it works for you and if you have any improvements for it.

Double your effective IO on AWS EBS-backed volumes

Fresh Elastic Block Storage volumes have first-write overhead

At my employer I architect Big Data hybrid cloud platforms for global audience that have to be FAST.  In our cluster provisioning I find we frequently overlook doing an initial write across our volumes to reduce write time during production compute workloads (called pre-warming the EBS volumes).  Per Amazon ( failure to pre-warm EBS volumes incurs a 5-50% loss in effective IOPS.  Worst case that means you could DOUBLE the IO portion of your HDFS writes until each sector has been touched by the kernel.  Amazon asserts that this performance loss, amortized over the life of a disk, is inconsequential to most applications.  For one of our current clusters we have a portion with 8-1TB drives in each of 10 compute nodes as a baseline.  Our estimated pre-warm time is 30 hours on each mount point so if done sequentially that’s 2,400 hours to touch each drive block.

What does this imply?  Without pre-warming we would have added as much as 2,400 additional hours of write latency during initial HDFS writes and that latency could appear in many different places in the stack (HDFS direct writes, Hive postgresql/mysql metadata writes and management, log writes, etc.)

Steps to optimize your EBS writes

Read the AWS document above carefully as it will ERASE EVERYTHING ON THE DISK if you use the first method in their article.  The steps below will execute this safely on disks with existing content.

To pre-warm the drives on your cluster:

  1. stop the cluster services
  2. ssh into each server
  3. execute lsblk and note the mount points (they likely start from /dev/xvdf and go down from there increasing the letter at the end, such as /dev/xvdg, /dev/xvdh, etc.)
  4. unmount each one at a time with sudo umount /ONEMOUNTPOINT
  5. Continue until all mount points are unmounted, meaning there’s nothing shown after the ‘disk’ column as below:
  7. execute the following, changing the if= and of= to the same mount pointsudo dd if=/YOURMOUNTPOINT of=/YOURSAMEMOUNTPOINT conv=notrunc bs=1MExample:  sudo dd if=/dev/xvdf of=/dev/xvdf conv=notrunc bs=1M
  8. Wait.  It’ll be a few minutes for a 32GB drive as shown in the Amazon write-up above or 1 day+ for a 1TB drive.
  9. After ALL the processes on the server complete, reboot the server


If you’d like to check the process or if your ssh session has expired and you want to ensure you’re still warming execute ps aux|grep YOURMOUNTPOINT , example:  ps aux |grep /dev/xvdf

A far better approach, of course, would be to automate this as part of your cluster deployment process using Chef or equivalent infrastructure automation tool.


How to recover a corrupt HDFS namenode

Scenario 1:  There was data, the logs say Namenode not formatted, the (check your config to see where it is) is empty

Cause:  The data was emptied out of your namenode directory.

Things to try (in order):

    1. FSCK (see scenario 2 below)
    2. recover the namenode
      1. hadoop namenode start -recover
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    3. Import the fsimage from a non-corrupt secondary namenode
      1. hadoop namenode -importCheckpoint
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    4. Brute force it
      1. Find out in the config where the snn checkpoint is kept (fs.checkpoint.dir)
      2. SCP down ALL the files in the fs.checkpoint.dir to your local machine
      3. SCP up ALL the files you just downloaded to the
      4. For all those files chgrp to hadoop, chown to hdfs, chmod 755
      5. Start your HDFS service as usual through the cluster manager and think optimistic thoughts.

Scenario 2:  There was data, the logs point to corrupt blocks

Cause:  Probably a bad termination signal during copy or high volume data movement with bad network

Things to try (in order):

    1. FSCK
      You can use

        hadoop fsck /

      to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with

        hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

      which ignores lines with nothing but dots and lines talking about replication.

      Once you find a file that is corrupt

        hadoop fsck /path/to/corrupt/file -locations -blocks -files

      Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

      You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

      Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

      Once you determine what happened and you cannot recover any more blocks, just use the

        hadoop fs -rm /path/to/file/with/permanently/missing/blocks

      command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.

Scenario 3: Secondary Namenode can’t checkpoint the namenode

the SNN logs show checkpoint failed, probably with missing txid=####

  1. Change /etc/fstab and set the mount point to allow fsck on boot
    1. vi /etc/fstab as root
    2. Change the last zero in the first line to one, so change:
      1. LABEL=cloudimg-rootfs / ext4 defaults 0 0


      2. LABEL=cloudimg-rootfs / ext4 defaults 0 1
    3. Save the file and exit
  2. Change the FSCKFIX in /etc/default/rcS to yes
    1. vi /etc/default/rcS as root
    2. Find the line that says #FSCKFIX=no
    3. Change it to FSCKFIX=yes (make sure you remove the commenting # at the beginning)
    4. Save and exit
  3. Check and record the last FSCK run
    1. execute and record the output of
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  4. Reboot (use AWS instance reboot or do it from ssh)
  5. Check that FSCK ran on boot
    1. execute and verify that the date changed using
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  6. Reverse the changes you made in steps 1 and 2
  7. Reboot