Scenario 1: There was data, the logs say Namenode not formatted, the dfs.data.dir (check your config to see where it is) is empty
Cause: The data was emptied out of your namenode directory.
Things to try (in order):
-
- FSCK (see scenario 2 below)
- recover the namenode
- hadoop namenode start -recover
- If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
- Import the fsimage from a non-corrupt secondary namenode
- hadoop namenode -importCheckpoint
- If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
- Brute force it
- Find out in the config where the snn checkpoint is kept (fs.checkpoint.dir)
- SCP down ALL the files in the fs.checkpoint.dir to your local machine
- SCP up ALL the files you just downloaded to the dfs.name.dir
- For all those files chgrp to hadoop, chown to hdfs, chmod 755
- Start your HDFS service as usual through the cluster manager and think optimistic thoughts.
Scenario 2: There was data, the logs point to corrupt blocks
Cause: Probably a bad termination signal during copy or high volume data movement with bad network
Things to try (in order):
-
- FSCK
You can usehadoop fsck /
to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with
hadoop fsck / | egrep -v '^\.+$' | grep -v eplica
which ignores lines with nothing but dots and lines talking about replication.
Once you find a file that is corrupt
hadoop fsck /path/to/corrupt/file -locations -blocks -files
Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.
You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.
Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.
Once you determine what happened and you cannot recover any more blocks, just use the
hadoop fs -rm /path/to/file/with/permanently/missing/blocks
command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.
- FSCK
Scenario 3: Secondary Namenode can’t checkpoint the namenode
the SNN logs show checkpoint failed, probably with missing txid=####
- Change /etc/fstab and set the mount point to allow fsck on boot
- vi /etc/fstab as root
- Change the last zero in the first line to one, so change:
-
LABEL=cloudimg-rootfs / ext4 defaults 0 0
to
-
LABEL=cloudimg-rootfs / ext4 defaults 0 1
-
- Save the file and exit
- Change the FSCKFIX in /etc/default/rcS to yes
- vi /etc/default/rcS as root
- Find the line that says #FSCKFIX=no
- Change it to FSCKFIX=yes (make sure you remove the commenting # at the beginning)
- Save and exit
- Check and record the last FSCK run
- execute and record the output of
sudo tune2fs -l /dev/xvda1 | grep “Last checked”
- execute and record the output of
- Reboot (use AWS instance reboot or do it from ssh)
- Check that FSCK ran on boot
- execute and verify that the date changed using
sudo tune2fs -l /dev/xvda1 | grep “Last checked”
- execute and verify that the date changed using
- Reverse the changes you made in steps 1 and 2
- Reboot