Recovery
Description | Power down, power up, and reinstall sequences. |
Notes | This guide is designed for our current cluster configuration. Note that HN or GN reinstall is strongly discouraged, but notes are provided in case it becomes necessary. It actually is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation. |
Last modified | September 10, 2015 |
- Power down and up procedures
- Recover from HN reboot
- Recover from GN reboot
- Recover from WN, IN, or SE reboot
- Recover from WN or IN reinstall
- Notes about reinstalling the HN & GN
Power down and up procedures
Before powering down, make sure you have a recent copy of the critical files to backup. Our backup script places critical files in /data/users/root on a weekly basis.
Place the SE in safemode following instructions on the HadoopOperations twiki (su- on SE-0-2):
hadoop dfsadmin -safemode enter
Wait 1 minute, and copy both the critical backup files in /data/users/root and SE-0-2 /scratch/hadoop/scratch/dfs/name/ to an external disk.
To power down the entire cluster, login to the HN as root (su -):
ssh-agent $SHELL
ssh-add
rocks run host compute R510 SE interactive grid "poweroff"
poweroff
Note that the network switch, KVM switch, disk array, and UPS cannot be powered off remotely. Note that compute-0-9 won't shutdown as it doesn't talk to the network, and guide below is re-written assuming it is still down. Also note that on our cluster, we have 'R510 SE compute' where a simpler cluster would just use 'compute'.
If you are concerned about the possibility of power spikes during shutdown, go to the RDC to power off the remainder of the system (/data physical disk cannot be powered down remotely):
- Flip both power switches on the back of the big disk array.
- Turn the bottom UPS off by pressing the O (circle) button. Put the top UPS on standby by holding the power button until the "output" reads 0v.
- Flip the power switch on the back of the bottom UPS.
- Remove the floor tile directly behind the cluster. If possible without undue strain to the connectors, unplug all four power cables from their sockets. These are locking plugs, so gently rotate the plug to release, then pull. Leave the cable plugs near the sockets as some are at a different amperage and mixing cables may lead to confusion later. Replace the floor tile.
To power up, go to the RDC:
- Remove the floor tile directly behind the cluster, plug in power cables in the floor (rotate to lock into place), and replace the floor tile.
- Flip UPS switch on
- Turn bottom UPS on by pressing | / Test button on the front. Turn top UPS on by pressing power button until output no longer has 0v. After a long shutdown, you may get a red light on the battery and beeping, that should go away.
- Turn the big disk array on by flipping both switches in the back. Flip one switch, wait for the disks and fans to spin up, then spin down. Then flip the second switch.
- Press power button on HN, which is the PowerEdge 2950 directly above the disk volume (the lower of the two 2950 nodes). Wait for it to boot completely. You may get many Dell Open Manage reports emailed due to powerup. Ensure the health of the system later with the omreport commands.
- Power cycle the network switch using its power cable (the switch has no switch
hardy har har). You may need a stepstool to reach the cable from the front of the rack. - Login on the HN as root.
- Open an internet browser and enter the address 10.1.255.254. If you don't get a response, wait a few more minutes for the switch to complete its startup, diagnoses, and configuration.
- Log into the switch (user name and password can be obtained from Marguerite Tonjes).
- Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
- Press the power button on the GN, the upper of the two 2950 nodes. Wait for it to boot completely.
- Press the power button on the SE node, the R410 above the GN.
- Press the power buttons on all 13 WNs and both INs. Wait a few seconds between each one.
- Follow the procedures below to recover from HN reboot, GN reboot, and WN & IN reboot.
Recover from HN reboot
First check that Ganglia is reporting for at least the HN (if Ganglia is not running, on HN as root, /sbin/service gmond restart). If the HN was rebooted independent of the other nodes (e.g., the power down and up sequence was not followed), best practice is to reboot all the other nodes as well. If the other nodes have already been rebooted in the proper sequence (as per a manual powerup), skip to checking OMSA. Our cluster uses compute interactive SE R510 to include all the appliance types, a simpler cluster may only use compute interactive.
- Check if other nodes rebooted
- In case SE reboot is needed
- Check OMSA
- Check Condor
- Check Backup disk mount
Check if other nodes rebooted
On the HN as root (su -) if this is a full shutdown recovery only, only if GN is not up and /datat not mounted properly, reboot the GN:
ssh-agent $SHELL
ssh-add
rocks run host grid "reboot"
Once the GN has completely rebooted(if that was needed)(check Ganglia to see if it's reporting), check to see when the remaining nodes rebooted:
rocks run host compute R510 interactive SE grid "who -b" collate=yes
Follow the procedures below to recover from GN reboot and WN & IN reboot. Monitor Ganglia to make sure that all 19 nodes come up and report to the HN after reboot. If they fail to do so, a complete reboot of the HN and other nodes may be in order. If this doesn't work, your best course of action is to email Marguerite Tonjes. Note that occasionally a node will not report on Ganglia, but is accessible via network, check that ganglia is running with service gmond status, and start that service if need be.
Please note that if one or more nodes didn't shut down properly (power outage), the following command can be used to check hardware for nodes that came back online but may have had another problem appear upon reboot:
omreport chassis
In case SE reboot is needed (not typical)
If the SE is already up, and you need to reboot the SE, you must place the SE in safemode following instructions on the HadoopOperations twiki (su- on SE-0-2). Alternately, if few datanodes have come up, you can put hadoop into safemode to protect the file index from changes:
hadoop dfsadmin -safemode enter
Wait 1 minute, and copy the critical NameNode index files on SE-0-2 in /scratch/hadoop/scratch/dfs/name to an external disk as backup. Then you may reboot the SE after the HN ssh-agent $SHELL; ssh-add as above.
rocks run host SE "reboot"
Once the SE is back up, be sure to leave hadoop safemode, following instructions from the HadoopOperations twiki (su- on SE-0-2):
hadoop dfsadmin -safemode leave
Check OMSA
OMSA should start automatically during boot, but since it is our emergency monitoring system, we always check that it came up successfully. Issue command line calls to check:
omreport system summary
omreport chassis temps
omreport system thrmshutdown
If OMSA isn't reporting correctly, restart it:
srvadmin-services.sh stop
srvadmin-services.sh start
and try again.
Check Condor
As any user on the HN:
condor_status
This should show 7 compute nodes reporting with 8 batch slots each, and 7 R510 nodes with 24 slots each, for 224 total. (March 2013: with R510-0-9 down, report 200 slots). Due to Condor reporting intervals, all machines may not report for ~15 minutes. Additionally, due to Condor slot release requirements, slots can take another ~15 minutes to report as available (from status Owner to Claimed or Unclaimed). If condor_status doesn't show a desired state within 30 minutes, there may be serious problems having little to do with Condor itself. You can try to restart condor service (service condor restart) on individual nodes, and as a last resort, you can attempt to reboot the HN and the rest of the nodes one more time to see if condor_status gives desired results. If this doesn't work, your best course of action is to email Marguerite Tonjes.
Check Backup disk mount
Check backup disk mount: do a "df -h" on the head node. On this node (and only this node), the disk /CampusBackup should be mounted with a total size of 500 GB. If it is not mounted, (as root) mount /CampusBackup. Note that if this is not done, the daily cron script to backup to this remote disk will fill up the local / disk instead. If there had been campus-wide troubles, this disk may not mount if the remote address is unreachable, be sure to fix it before the nightly cron script runs.
Recover from GN reboot
First check that OMSA started and is monitoring the GN correctly using commands in the HN OMSA section. The GN also has an OMSA web interface if a GUI is preferred.
Check that the GN is reporting to the Ganglia monitor. If it doesn't after a few minutes, there may be a serious problem requiring another GN reboot (looking at the screen at the cluster may help). If this doesn't work, your best course of action is to email Marguerite Tonjes. Note that occasionally a node will not report on Ganglia, but is accessible via network, check that ganglia is running with service gmond status, and start that service if need be.
Usually if Condor is reporting correctly on the HN, the GN is also correctly linked into the pool. However, as any user on the GN, call condor_status. If this fails to report correctly within 30 minutes, try to restart the condor service (service condor restart), then if needed, attempt a GN reboot; if condor_status continues to fail on the GN but not the HN, email Marguerite Tonjes.
You need to do the OSG, and PhEDEx commands in order below after EVERY reboot, they will not necessarily restart correctly on their own.
- Mount /data on the other nodes
- Check OSG status
- Start OSG services which did not come on automatically
- Test OSG
- Start PhEDEx
Check /data mount on the other nodes
Since our large disk array is a direct NFS mount (not an auto-fs mount), nodes which haven't been rebooted since the GN rebooted might need to mount /data from the GN again. This is especially true for the HN, which always boots before the GN boots. Check to make sure that the mount was sucessful on all the nodes, and remount if necessary. As root on the HN (su -):
ssh-agent $SHELL
ssh-add
rocks run host compute R510 SE interactive grid "ls /data" collate=yes
If /data mount is needed on HN use the following (note the above doesn't check /data mount on the HN):
umount /data
mount /data
ssh-agent $SHELL
ssh-add
rocks run host compute R510 SE interactive "umount /data"
rocks run host compute R510 SE interactive "mount /data"
Check OSG Status
OSG services are supposed to start during boot time, but with OSG3, they sometimes do not come up or do so in an unstable state. In our last cluster reboot only condor-cron and rsv needed to be started after GN reboot. If needed, follow the Stop OSG and following steps below, otherwise skip to Test OSG and Start PhEDEx steps below. As root (su -) on the GN you can check the status of all OSG services to avoid Stopping/Starting all of them (we have glexec currently disabled, that's ok):
/sbin/service rsv status
/sbin/service condor-cron status
/sbin/service httpd status
/sbin/service bestman2 status
/sbin/service globus-gridftp-server status
/sbin/service gratia-probes-cron status
/sbin/service tomcat5 status
/sbin/service globus-gatekeeper status
/sbin/service gums-client-cron status
/sbin/service condor status
/sbin/service fetch-crl3-cron status
/sbin/service fetch-crl3-boot status
Start OSG services which did not come on automatically
OSG services are supposed to start during boot time, but two of them do not. As root (su -) on the GN, restart these and any other services which were not running in the above status check:
/sbin/service bestman2 restart
/sbin/service httpd restart
/sbin/service rsv restart
Test OSG
We also force run RSV probes right away instead of waiting for the usual RSV cron cycle to complete (which can take minutes to hours):
su - rsv
rsv-control --run --all-enabled
Updated RSV probes usually show on the RSV probe page within ~20 minutes. Make sure all probes are reporting OK with a recent timestamp. If you cannot get the RSV probes to work, follow the full procedure to Stop OSG, Restart GUMS, and Start OSG from the Errors page.
- Some RSV probes, especially cacert-expiry and crl-expiry, often report a warning and can be safely ignored, especially if OSG services have been down for an extended period.
- The ping-host, jobmanager-default-status, jobmanagers-available, and gram-authentication probes are especially important. If they report a warning or error, there is usually a serious problem requiring investigation, usually starting with the Errors page, and ending with a GN reboot.
- gridftp-simple, srmping, and srmcp-readwrite are also important probes, but occassionally will report an error within ~15 minutes of first starting OSG services. If they report an error, force run the RSV probes again to see if the error goes away. If it doesn't, there is usually a serious problem requiring investigation, starting with restarting OSG services, possibly even reboot of the GN.
OSG services are also tested by SAM, which has additional tests specific to the CMS environment. SAM also has a long cycle and can take several hours to update. March 2013: failing SAM MC test (still debugging). Warning on three other tests for software we do not have installed.
You can (optionally) also test manually by submitting a CRAB job to UMD. To test our ability to service most CRAB features, find a dataset hosted at UMD and set the following in your crab.cfg:
datasetpath = /something/hosted/at/UMD
return_data = 0
copy_data = 1
storage_element = T3_US_UMD
user_remote_dir = any/subdirectory
se_white_list = T3_US_UMD
ce_white_list = T3_US_UMD
This will test our ability to submit CRAB jobs, service CRAB jobs, and accept output sent directly to our SE. SE output should show up in /hadoop/store/user/cern_username/any/subdirectory.
Start PhEDEx
PhEDEx does not automatically start during boot and each of the three PhEDEx services must be manually started. We have aliased commands in the ~phedex/.bashrc file, corresponding to the full commands here. Because the PhEDEx environment can get quite long, it is safest to execute each set of these commands in an entirely fresh shell. As phedex (su - phedex) on the GN:
prodEnv
prodService start
debugEnv
debugService start
devEnv
devService start
The PhEDEx website link monitor is the simplest way to see if PhEDEx started and reconnected properly. If a few links are down (red) but most are up (green), the problem is rarely on our end and can be safely ignored. If all links are down, the PhEDEx services may not have started properly. They can be stopped by changing "start" in the above commands to "stop", then started again. Since we have little PhEDEx activity at any given time, PhEDEx failing to start is a low priority problem, but should be reported to Marguerite Tonjes.
Recover from WN or IN reboot
Check that the nodes are reporting to the Ganglia monitor. If they don't after a few minutes (5 for 8 CPU machines, up to 10 for 24+ CPU machines), there may be a serious problem requiring another reboot. Note that occasionally a node will not report on Ganglia, but is accessible via network, check that ganglia is running with service gmond status, and start that service if need be. If this doesn't work, your best course of action is to email Marguerite Tonjes.
Most of the following commands are executed from the HN as root in order to federate calls to all the WNs and INs into a single call from the HN instead. Of course any command can be executed on individual nodes if desired.
- Check OMSA
- Check Condor
- Check /data mount
- Check /hadoop mount
- Check hadoop health
- Check /cvmfs mount
Check OMSA
OMSA should start automatically during boot, but since it is our emergency monitoring system, we always check that it came up successfully. Issue command line calls to check as root (su -) from the HN:
ssh-agent $SHELL
ssh-add
rocks run host compute R510 interactive grid SE "omreport chassis temps" collate=yes | grep Reading
rocks run host compute R510 interactive grid SE "omreport system thrmshutdown" collate=yes | grep Severity
Be sure to check that all nodes respond correctly. If OMSA isn't reporting correctly, restart it:
rocks run host compute R510 interactive grid SE "srvadmin-services.sh stop" collate=yes
rocks run host compute R510 interactive grid SE "srvadmin-services.sh start" collate=yes
and try again.
If it is a single node which rebooted and you are trying to make sure there is no problem in hardware, the following command will list "ok" if everything is fine in hardware. You will have to check logs for disk problem.
rocks run host compute-0-1 "omreport chassis"
Check Condor
Newly rebooted WNs can take ~15 minutes to 'report for duty' to the condor pool. As root on the HN, check with condor_status, as discussed previously. If they fail to report and show as available within 30 minutes try restarting condor (service condor restart) on the affected nodes, and at last resort, a reboot of the HN and/or WNs may be necessary.
Usually if Condor is reporting correctly on the HN, the rest of the nodes are also correctly linked into the pool. However, we check to be sure. As root (su -) from the HN:
ssh-agent $SHELL
ssh-add
rocks run host compute R510 interactive grid SE "condor_status" collate=yes | grep Total
Be sure to check that all nodes respond correctly (see number of CPUs above in HN checks). If this fails to report correctly within 30 minutes, try to restart condor (service condor restart), with a last resort to attempt another reboot of the WNs or INs in question; if condor_status continues to fail on the WNs or INs but not the HN, email Marguerite Tonjes.
Check /data mount
Sometimes nodes fail to mount /data when they boot, cause unknown. Check that it's been mounted successfully as root (su -) on the HN:
ssh-agent $SHELL
ssh-add
rocks run host compute R510 interactive grid SE "ls /data" collate=yes
Be sure to check that all nodes list the directory contents correctly. If this fails, attempt to unmount, then mount /data again:
rocks run host compute R510 interactive grid SE "umount /data" collate=yes
rocks run host compute R510 interactive grid SE "mount /data" collate=yes
then check again (in rare occasions, three separate umount /data, or umount -nf /data, followed by mount /data did work). If this continues to fail, a reboot of the GN with subsequent reboot of the WNs and INs may be required. Failing this, email Marguerite Tonjes.
Check /hadoop mount
If the SE was rebooted, be sure to leave hadoop safemode once your are sure that all 14 nodes are reporting properly to hadoop (give it 15 minutes), following instructions from the HadoopOperations twiki (su- on SE-0-2):
hadoop dfsadmin -safemode leave
Check that /hadoop has been properly mounted on nodes (no /hadoop mount on HN). Hadoop requires the SE to be up. Check that it's been mounted successfully as root (su -) from the HN:
ssh-agent $SHELL
ssh-add
rocks run host compute R510 interactive grid SE "ls /hadoop" collate=yes
Be sure to check that all nodes list the directory contents correctly. Hadoop can be started with commands documented on the General page. With continued failure, it may also require SE hadoop restart (service hadoop restart), with as last reort: SE reboot (see note about SE reboot and safemode above), followed by INs reboot. Failing this, email Marguerite Tonjes.
Check /hadoop health
Check that all compute and R510 nodes are online. Check that hadoop sees all 14 live datanodes properly using firefox on an interactive login (either as yourself on interactive node or on the HN root GUI). March 2013: 13 live datanodes, IP address for compute-0-9 will show as "Dead" and R510-0-8 won't be listed. This link can only be browed from within the cluster, and needs to be manually refreshed to see changes.
firefox &
http://se-0-2:50070/dfshealth.jsp
You can check if the hadoop service is running on all datanodes from the HN (after ssh-agent $SHELL; ssh-add as above):
rocks run host compute R510 "/sbin/service hadoop status" collate=yes
If you see a message like "hadoop dead but pid file exists", then hadoop needs to be restarted on that particular node ("/sbin/service hadoop restart"). If the node shows hadoop running but the SE thinks it's "dead", then run on that one node: "/sbin/service hadoop restart"
If you are unable to see all 14 live datanodes in hadoop under firefox and they do exist according to Ganglia, continue debugging with suggestions on the Error page. With continued failure, it may also require restart of hadoop on datanodes or SE (service hadoop restart), taking care that the SE has working hadoop before restarting service on the WNs. The last resort would be SE reboot (see note about SE reboot and safemode above), followed by WNs reboot. Failing this, email Marguerite Tonjes. Note that compute-0-9 is on the Dead Node list by IP address, and R510-0-9 needs re-kickstart. Note that occasionally a node will not report on Ganglia, but is accessible via network, check that ganglia is running with service gmond status, and start that service if need be.
Check /cvmfs mount
On the compute, R510, interactive and grid nodes, CVMFS should be mounted, check it from the HN (after ssh-agent $SHELL; ssh-add):
rocks run host interactive grid compute R510 "ls /cvmfs/cms.cern.ch/cmsset_default.csh" collate=yes
All nodes should report except the file except for R510-0-9 which needs kickstart repair. If there are problems with cvmfs, check that Squid is running on the HN and follow the debugging links for CVMFS on the Errors page. Failing this, email Marguerite Tonjes.
Now you are done with powering up/recovering from cluster reboot!
Recover from WN or IN reinstall
First configure OMSA as per the directions below, then perform all the checks from recovering from WN or IN reboot. Also be sure that the appliance xml are setup not to over-write the Hadoop volumes during reinstall. Note that the commands below do not include our special "R510" compute nodes.
On the HN as root (su -), configure OMSA on the WNs and INs:
ssh-agent $SHELL
ssh-add
rocks run host compute interactive "/share/apps/OMSA/OMSAconfigure.sh"
Additionally, the INs have an external network port on eth1 which must be manually configured. Rocks will configure eth1 almost completely, but has a bug and doesn't set the external network gateway. Use system-config-network on each IN to set it manually (following our network configuration). Note that one of the INs is a secondary hadoop NN, so be sure to setup one of them as the SNN after installation (or modify the kickstart for that node alone). Additionally, the local hostname of that node is in the Hadoop configuration on all nodes using hadoop, so if that changes, be sure to propagate that configuration file.
Notes about reinstalling the HN, GN, & SE
The HN, GN, & SE are NOT intended for reinstall. Regardless, some notes are provided if it becomes critically necessary. We have backup scripts which copy critical files needed for recovery to /data/users/root. In most cases, do not over-write the /scratch directory when given the option.
Reinstalling the HN
Reinstalling the HN requires a reinstall of all other nodes since it is the Rocks frontend and manages the network and software on all other nodes. To reinstall the HN, follow all or most of the material in the General guide. Note that user accounts and passwords can be preserved by backing up /etc/shadow, /etc/passwd, and /etc/group. After completing the Rocks installation, use the AddAccounts.py script:
python AddAccounts.py --userfile=passwd --groupfile=group
where passwd and group are the backed up files. To recover original passwords, the encrypted lines in the backed up shadow file need to be manually merged into the current /etc/shadow file.
***Additionally, data in /home can be preserved through cautious and careful choices during the Rocks installation. Prior to reinstall, write down the partition on which /export is installed (use df -h, currently /export is on /dev/sdb1). During Rocks installation, choose manual partitioning, select the appropriate partition, and set the mount point as /export. Make sure the option to format is not selected.
Reinstalling the GN
Before Kickstarting:
- The data resident in /data can be preserved via careful partition commands in the grid.xml Kickstart file (on the HN). Specifically, this line should be present in the <pre> section of /export/rocks/install/site-profiles/5.4/nodes/grid.xml on the HN:
ignoredisk --drives=sdc - The Kickstart file does not include directives to reinstall software. However, both PhEDEx and CMSSW are entirely self contained in their respective installation directories and can be preserved through subsequent installs via partition directives in the <pre> section of grid.xml:
part /localsoft --onpart=sda6 --noformat
part /scratch --onpart=sdb1 --noformat
After Kickstarting:
- Rocks will configure eth1 (the external network port) almost completely, but has a bug and doesn't set the external network gateway. Use system-config-network to set it manually (following our network configuration).
- The grid appliance inherits from the compute appliance, which network mounts /data from the GN in /etc/fstab. However, since the GN is the node from which /data is network mounted, the /etc/fstab file on the GN needs to be manually edited after Kickstart. Specifically:
- Edit /etc/fstab and remove the line:
grid-0-3:/data /data nfs rw 0 0
then add the line:
/dev/mapper/data-lvol0 /data xfs defaults 1 2 - Mount it on the GN:
mount /data - Make it available to other nodes. Edit /etc/exports and add the line:
/data 10.0.0.0/255.0.0.0(rw,async)
the /scratch directory is also auto-network mounted on other nodes as /sharesoft, so add the line:
/scratch 10.0.0.0/255.0.0.0(rw,async)
then start the nfs service on the GN:
/etc/init.d/nfs restart
then make sure the nfs service will always start when the GN reboots:
/sbin/chkconfig --add nfs
chkconfig nfs on - Mount /data on the other nodes as per the GN reboot guide.
- Edit /etc/fstab and remove the line:
- TCP tuning needs to be done again for PhEDEx transfers. Follow the PhEDEx instructions to get a proxy. PhEDEx services can be started as per the GN reboot guide, if necessary.
- CMSSW doesn't have any services which must be started, though Squid for Frontier should already be running on the HN.
- OMSA base packages are installed via Kickstart on all nodes, but we choose to install (and configure) all the OMSA packages on the GN following the Hardware guide.
- To the best of our knowledge, OSG cannot be recovered even if the data in the installation directory is preserved because there are a few files (notably services) which exist in directories that are formatted during Kickstart (like /etc/init.d). Follow the OSG instructions to reinstall OSG from scratch.
Before Kickstarting:
- The important information in the SE are the files in /scratch/hadoop/scratch/dfs/name which Hadoop automatically copies periodically to the Secondary NameNode (/scratch/hadoop/) and to /share/apps/hadoop/checkpoint
- The configuration files are also essential, and should be copied during Kickstart, be sure Rocks has them up to date: /etc/hadoop/conf/
- Hadoop is setup with the internal network name of SE-0-2. As Rocks has a tendency to pick up other MAC addresses during kickstart, the network name and IP of SE-0-2 may change during kickstart. Thus you would then need to change the name in Hadoop configuration files on all the datanodes (compute and R510), interactive nodes, GN, and the Rocks kickstart area.
- Place the SE in safemode following instructions on the HadoopOperations twiki (su- on SE-0-2):
hadoop dfsadmin -safemode enter
Wait 1 minute, and copy the critical NameNode index files on SE-0-2 in /scratch/hadoop/scratch/dfs/name to an external disk as backup.
After Kickstarting:
- Refer to the Hadoop Admin guide for setup and configuration if it did not work out of the box.