April-May 2012 Log
Major outage in May for upgrade to OSG3.0 , CMS grid software changes, and other updates. One disk in RAID-5 /data replaced, hadoop upgraded. In progress ...
May 24, 2012
MT
- Got gratia properly configured and running, all SAM/SUM tests pass and cluster started accepting outside jobs. Installed voms software on IN nodes. Note I get occasional (but not consistent) failure on SUM test #19: /usr/etc/globus-user-env.sh unreadable or not found - need to check OSG install on all WNs. Still do not have working srm-copy. According to metrics, node is fully CMS grid capable. Full list of fixes necessary for OSG 1.2-OSG3 upgrade will go on separate Upgrade web page.
- Expect that all previous Hadoop java errors and gridftp-simple RSV probe errors have been fixed now.
May 22, 2012
MT
- Had to fix a number of configuration files, got RSV srm probes working, lcg-cp and globus-url-copy both work, but srm-copy commands do not. Fixed file to point to proper PhEDEx LoadTest07 area (how it was working with this error, I do not understand).
May 19, 2012
MT
- Fixed GUMS certs installation, worked to debug RSV srm probes.
May 18, 2012
MT, JT
- Installed OSG3 upgrade. Was also required to upgrade Hadoop to OSG3 version of 0.20. Installed verts on interactive nodes. Worked to remove parts of OSG1.2 no longer needed (some areas may be needed).
- Trouble bringing /hadoop back online after last reboot, was able to debug and system appears healthy. Problems were related to change of hadoop username with upgrade.
- SUM tests for CE still failing, two srm RSV tests failing, grid services not completely working.
May 17, 2012
MT, JT, CF, JG
- Started training new sysadmins with scheduled shutdown. Discovered during shutdown that one of the disks for the RAID-5 /data system had a blinking orange light and had died. Following Dell documentation, replaced it with a replacement 750GB disk and system put it back in. /data appears to be fine. Also had a couple nodes whose power cords needed tightening.
- Ran software updates. Updated CRAB to 2_8_1, updated gLiteUI to 3.2.11 (both on interactive nodes: affects users outgoing CRAB jobs). Modfied R510 and compute nodes in hadoop to use dfs.datanode.max.xcievers = 4096 per OSG All Hands Meeting recommendation to get rid of java errors in hadoop. Finalized Hadoop 0.20 install (had not been done). Moved corrupt Hadoop file to lost+found to make a clean system (already removed corrupt SAM test files and emailed user about user owned corrupt file).
- Tried implementing log4j.appender.DRFA.MaxBackupIndex=30, but that is not possible, apparently only option is a cron job.