January - March 2012 Log

Dealing with various Hadoop issues (especially fuse mount of /hadoop), R510s TCP tuned and channel bonded, compute nodes TCP tuned, SE-0-1 DIMM replacement, R510 node motherboard replaced & two hard drives replaced. Firmware updated R510 nodes to account for spurious HV error in omreport system.

To Do & Sandbox

March 31, 2012

Overnight a user had written more than 200 GB to /home. This brought down most services on the cluster, breaking it for everyone. Had to remove a large chunk of the recent simulation files that had been made by that user to restore cluster function. Users are reminded that "df -h" shows you how much space is available on a drive, and that they should write no more than 25GB to their /home directory. "du -sh directoryname" shows you how much space a directory is taking up, so one can run a small process, and use that to estimate how that would scale up for a full analysis production.

March 16, 2012

Saw that /data was not accessible from any node. Attempts to list the contents of /data resulted in "Stale NFS file handle" error. Reboot of grid node fixed the problem. No obvious cause found in examining the logs.

March 6, 2012

Dell suggested that problems with R510-0-2 were spurious Voltage chassis error associated with running old firmware. Performed firmware update per our Hardware page, still had voltage errors. Performed a BIOS reset (power off; unplug; hold down power button for 3 minutes; plug in; power on). R510-0-2 no longer has voltage chassis errors. Opened node back up to condor, will consider doing other nodes one at a time (minimize Hadoop problems) in the near future. (later did other nodes)

March 2, 2012

Added in omreport system alertlog monitoring to regular weekly checks. Found that R510-0-2 is showing a Voltage error (since Jan. 18), and it presents like the same failed motherboard error that R510-0-5(now 17) had. Putting in support call to Dell, and removing it from Condor. Expect only ~2 hours downtime for that node when we get the part and Tech for repair. Checked that node is powered properly at cluster room. During check, had some KVM response problems from R510-0-17, which was fine upon reboot (all cables connected properly).

February 28, 2012

Noted that compute and R510 nodes each had 4-6 instances of the fuse /hadoop mount. No idea why, not sure if it's a problem. Over the weekend, had lost /hadoop fuse mount on two separate compute nodes. Implemented TCP tuning on compute nodes and selectively (while jobs not running, and one at a time to not affect hadoop replication) rebooted nodes.

February 27, 2012

JT & MT

R510-0-17 replace two failed hard drives, hadoop running ok, node back on condor. Now full ~86TB for Hadoop availability. Calculated if one R510 goes down, would lose 12TB (with replication), so keep goal of 90% or less in /hadoop volume.
JT: implemented cron jobs to monitor /home drive space, /var, and R510 & SE health
/hadoop lost fuse mount from a couple compute nodes over the weekend, saw new java error in Hadoop logs which will need understanding and debugging:
- 2917953-2012-02-24 22:19:05,471 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.255.241:50010, storageID=DS-626793967-10.1.255.241-50010-1313696050264, infoPort=50075, ipcPort=50020):DataXceiver 2917954:java.io.IOException: xceiverCount 262 exceeds the limit of concurrent xcievers 256 2917955: at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
- Note that this particular error correlates with a set of jobs which apparently overwhelmed all nodes (R510, compute) with many file I/O requests (lots of LAN network usage correlates with the troubles). /hadoop needed to be remounted multiple times while those jobs were running, and /scratch on many R510 logs went to 100% with hadoop logs being filled with errors. Cleaned scratch yesterday on those nodes.
MT: deleted /data/se/store as it was fully duplicated to /hadoop/store, and cleared 2.5TB in /data

February 22, 2012

JT & MT

Replaced DIMM_A3 in SE-0-1 with new DIMM. System appears to be working fine.
R510-0-17 (was R510-0-5) kickstarted on Monday to link rocks and new MAC addresses from new motherboard. Two "problem" disks (/dev/sdb, /dev/sdf) are still reporting smartd errors, BUT have no errors in hadoop (running 24 hours just fine) or Dell Storage Manager. Running Dell storage diagnostics to find out why... (both drives failed in diagnostics - Feb. 23)

February 19, 2012

JT (onsite) & MT (remote)

SE-0-1 went down at 10:22pm on Feb. 18. Grid node went to red in Ganglia due to various grid jobs and processes not completing. Jeff had to reboot by hand, SE-0-1 was reporting "CPU1" error on the front of the node. Once accessible online, saw that the following errors were reported:
- Severity : Non-Recoverable
  Date and Time : Sat Feb 18 22:50:35 2012
  Description : CPU 1 machine check detected.
- Severity : Critical
  Date and Time : Sat Feb 18 22:50:36 2012
  Description : Multi-bit memory errors detected on a memory device at location DIMM_A3
SE-0-1 rebooted, apparently worked for hadoop even still reporting a critical DIMM_A3 error. Had ~20 minute downtime at 3:20pm Feb. 19 to run system diagnostics on SE-0-1, found this error, but completed memory diagnostics that everything checks "ok", and omreport shows no new errors and Memory is "OK". Will contact Dell service tomorrow for this. Note this is the second outage of this sort.
- unrecoverable ECC error DIMM_A3
SE-0-1 apparently healthy now (4pm Feb. 19), grid jobs, etc. working. May have sporadic problems and unannounced downtimes as we work to resolve this. Attempts to bring up Secondary NN as primary failed.

February 17, 2012

JT & Dell Tech Rep.; MT (remote)

R510-0-5 offline in preparation for motherboard replacement. Motherboard replaced, system checks out fine on site. Probably need to kickstart it to fix Rocks connection to new MB (not visible currently on network), however kickstart file missing port bonding and other recent updates. MT to do homework on that. Also check/validate network connections next time in cluster room for debugging.
Remember when we bring it back up to check system health (incl. disks - two may still be bad), and chkconfig condor on, as well as chkconfig hadoop on (once verified to be fine). If disks are still bad, leave condor and hadoop off until those are removed, replaced, or repaired.

February 16, 2012

R510-0-1 /hadoop5 disk showed smartd 4 unreadable sectors. Ran fsck, no change, erased and created new partition, rebooted. Sector error stil occuring... Outstanding issue.
R510-0-8 had lost contact with Hadoop on Feb. 8, hadoop was not set to start upon reboot (chkconfig --list), fixed that, rebooted, apparently fine now.
R510-0-5 disk work: 2 failed disks, voltage problem reported critical, remove from condor and hadoop, debug with Dell phone support. It has bad motherboard, Dell sending replacement.

February 13, 2012

On HN and GN, the /var/log filled up quickly overnight with messages about gridftp transfers. This took down condor. RSV and SAM tests began failing. Solution: remove (after checking) excessive logs, restart condor on HN. Checked that cron job is still functional. Stop and restart condor on remainder of nodes. While system was coming back up, implemented channel bonding on remainder of R510 nodes (requiring reboot of those nodes).

February 8, 2012

MT & JT

Had cluster outage. Implemented TCP tuning on R510 nodes. Tested and those nodes seem to keep Fuse mount of /hadoop. Installed extra ethernet cables for all R510 nodes, and tested channel bonding on R510-0-9 and R510-0-5, found bug in how Rocks 5.4 implements, had to modify iptables by hand logged into the node directly, and reboot the node.
Unresolved:
- R510-0-5 shows critical Voltage error (was going to replace /hadoop6 and /hadoop2)
- R510-0-1 shows 3 unreadable sectors on /hadoop5. fsck didn't fix any of the above (it did for R510-0-5 for a time)
- Hadoop log4 properties to keep /scratch from filling up
- Random gridftp failures still associated with random java failure, java updates didn't fix, maybe known hadoop 0.2 bug, or need more TCP improvement on GN.

January 31, 2012

MT & NH

SE-0-1 had DIMM_A3 critical error and shut down, which brought down hadoop for all nodes. Tried rebooting machine, gave CPU1 error. Ran Dell Memory Diagnostics, apparently it repaired DIMM_A3. If another error, consider reseating DIMM_A3 (note it was reseated in July 2011), then removing/replacing. Note that system tried to email/text sysamins but has no route to the outside: perhaps setup a script for that?

January 29, 2012

R510-0-5 directory /hadoop6 had problems again (had problems in Sep, 2011). Reformatted disk. It appears to be working...
Problems still to solve:
- R510 nodes occasionally lose FUSE contact with /hadoop. However the namenode still sees them. This means that condor jobs trying to access files in /store will fail. This happens ~once a month and nodes require reboot. Suspect java/Fuse incompatibility from logs, but not sure why these nodes fail and compute nodes do not. Suspect this is a similar problem as the randomly failing gridftp-simple RSV tests, and the less frequent randomly failing SAM MC tests.
- Compute nodes fill up on /scratch with hadoop logs. Experimented with hadoop log rotation settings, which do not seem to take effect. Have cleaned by hand, but need to solve this.
- Note newest R510 nodes not picked up as hadoop disks. Need to find setting and implement this before next cluster reboot.
- Compute node /hadoop1 disks are at 100% (few MB remaining), this could cause problems, recommend hadoop node balancing (long process) in the future. Consult Hadoop Operations Guide for more instructions.

January 2-9, 2012