August 2009 Log
Reconfigured cluster to add grid and interactive nodes. Installed OSG 1.2. Now installing CMSSW via automatic jobs from Bockjoo. Installed PhEDEx 3.2.0 and switched to using srm-copy instead of srmcp.
August 29, 2009
MK -- Rebooted HN, all other nodes, reconfigured BMC's, reinstalled interactive-0-0, compute-0-2 & compute-0-7
- Rebooted the HN in order to turn off its BMC, which was clashing with the external IP address for interactive-0-0 (hepcms-in1.umd.edu).
- All other nodes had to be rebooted because their BMC's had incorrect DHCP entries and needed to get new ones from the rebooted HN.
- Several BMC's had bad configurations and required fixing.
- compute-0-2 & compute-0-7 had incorrect "rocks list host interface" settings, so corrected them and reinstalled. Reinstalled interactive-0-0 just in case it needed a proper reinstall to have correctly configured external network. Probably not necessary, but harmless.
- hepcms-in1.umd.edu now responding to ssh, so informed users to ssh to hepcms.umd.edu, which round robins to the two interactive nodes.
August 27, 2009
MK -- Mapped SAM DN to SAM account
- Check grid-mapfile-local for details.
August 25, 2009
MK -- Turned off interactive-0-0 BMC
- Jules @ OIT confirmed that there were two devices trying to communicate on the IP address assigned to interactive-0-0. I turned off the BMC on interative-0-0 in case that was the problem, though its configuration seemed OK and ssh continued to not work to interactive-0-0.
- Jules tracked it down to something on the HN trying to take that IP address. Perhaps I told the BMC on the HN to talk to the outside world! So I need to reboot the HN to turn its BMC off. Was not able to contact its BMC via ipmish.
August 24, 2009
MK -- Installed CMSSW_2_2_12 & 2_2_13, mapped Bockjoo to cmssoft account
- Also gave cmssoft ownership of grid3-locations.txt file, made OSG_APP world writable and all subdirs world writable. Asked Bockjoo to begin installing CMSSW.
August 20, 2009
MK -- Fixed MyOSG GIP GLUE error
- Turned out I had forgotten to change my SE port number back to 8443 in my config.ini file. Stopped services, fixed, started services. Removed host cert from local grid mapfile, removed /data/se/osg/(cms, mis, ops) subdirectories.
August 18, 2009
MK -- Tried to debug MyOSG GIP GLUE error
- GIP is still giving the GLUE error which had been indicative of an account mapping problem previously. Added host cert (not just http cert) to mapfile as well. Suspect it's some other problem, like sourcing the OSG CE environment in the uscms01 ~/.cshrc or a sudo problem for daemon. After reading /sharesoft/osg/ce/gip/var/logs/gip.log, I thought the problem might be that I deleted the old directories named mis, cms and ops in /data/se/osg. So I made them again and gave their matching users ownership. No dice, but haven't deleted the directories just yet.
August 17, 2009
MK -- Started OMSA on all nodes, installed PhEDEx 3.2.0 using srm-copy
- Forgot that every time nodes are reinstalled, OMSA install has to be finalized as well as configured. Called appropriate scripts. Also need to configure various storage settings for the grid node (TODO).
- Installed PhEDEx 3.2.0 on grid node and configured using srm-copy instead of srmcp. Unsuspended LoadTest from FNAL, which they upped to 5 MB/sec while we're testing. Will need to check PhEDEx logs tomorrow to make sure srm-copy configuration is OK.
August 16, 2009
MK -- Installed SE in CE directory, changed grid-0-0 hostname, reinstalled OSG CE & SE
- srm-ping was failing no matter what port I used. Installed BeStMan in /sharesoft/osg/ce instead and srm-ping succeeded.
- srmcp and srm-copy were failing. event.srm.log gave the error:
ts=2009-08-17T20:09:21.465Z level=Console class=gov.lbl.srm.util.TSRMUtil [srminfo]Unrecognizable url:srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/tmp/1250539759-storage-probe-test-file-remote.29673 for this server:httpg://grid-0-0.local:8443/srm/v2/server tid=Thread-13
hostname on grid node prints grid-0-0, unlike on the HN. So I followed these instructions to change the hostname to hepcms-0.umd.edu. After changing hostname, various CE services wouldn't start, such as mysql5, condor-cron, and osg-rsv. Reinstalled OSG CE & SE in a completely fresh area.
August 15, 2009
MK -- Changed Rocks HN hostname & IP to hepcms-hn.umd.edu, gave grid node hostname hepcms-0.umd.edu
- Used rocks set host interface, system-config-network, rocks set var service, mysql, /etc/sysconfig/static-routes, /opt/condor/etc/condor_config.local to change the hostname & IP address of the Rocks HN. Details will be posted in the admin how-to guide.
- Gave the grid node the hostname hepcms-0.umd.edu and corresponding IP address.
- Updated configuration page.
August 12, 2009
MK -- Installed CMSSW releases on the grid node
- Installed all existing CMSSW releases, as well as CMSSW_3_2_4 on Sarah's request.
August 11, 2009
MK -- Powered new KVM & 2950 (grid node)
- Switching to different machines on the KVM simply requires pressing the PrintScreen button.
- No problems getting the grid node to run from the new appliance kickstart file. Began installing CMSSW on a new network mount /sharesoft.
August 10, 2009
MK -- Inserted new PDU, KVM, and 2950 into the rack.
- Not yet powered. Bumped power cable on one of the WN's, but it came back up with no problems.
August 7, 2009
MK -- Solved GIP GLUE error, installed BeStMan from OSG 1.2
- GIP GLUE error is because OSG 1.2 now gathers some information for BDII by issuing various srm commands. It uses the site certificate to generate the proxy and our site cert is not in our grid mapfile. We followed the instructions here to add our site DN to the grid mapfile and the error went away.
- Installed BeStMan, but forgot to add new information to the sudoers file, so initial RSV tests failed. Added new information according to these instructions and passed. Specifically, srm-rm was failing. I need to ascertain if the old information in the sudoers file is still necessary. When upgrading to the new grid node, it's also not clear if the sudoers information needs to be on the Rocks node, the grid node, or both. I suspect the Rocks node.
August 6, 2009
MK -- Installed OSG CE 1.2, modified syntax in grid3-locations.txt file
- Initially failed SAM tests due to reporting in grid3-locations.txt file. Rob Snihur had contacted me earlier about a possible syntax problem with the line:
VO-cms-slc4_ia32_gcc345
I changed it to:
VO-cms-slc4_ia32_gcc345 slc4_ia32_gcc345 /software/cmssw
And we passed SAM tests again. - Tests using RSV and CRAB OK. However, we had a MyOSG GIP validation test error:
GLUE Entity GlueSEAccessProtocolLocalID does not exist
August 4-5, 2009
MK -- Created grid & interactive appliances
- Most work involved getting the grid and interactive appliance kickstart xml configured such that it would not run condor jobs on the grid and interactive nodes, but still retain the ability to submit condor jobs.
- Also modified the partition table on the grid appliance.
- Both have the meta packages "Workstation Common" and "Authoring and Publishing", which should include all the needed latex and emacs rpms.
August 3, 2009
MK -- Edited extend-compute.xml
- Removed some manual installs of individual packages such as screen or gcc and added meta packages which include these packages already. Should make for a more complete set of utilities on the worker nodes. Also removed some code that will only be executed on the interactive nodes, such as installing gLite-UI & CRAB.