October 2008 log
Commissioned PhEDEx links to UCSD & Purdue, practiced power down and up procedure with Marguerite, added security (check /root/security.txt for details) & upgraded to OSG 1.0.
October 31, 2008
MK -- Installed RSV configure-osg.py fix, removed CMSSW_2_1_2 & 2_1_3, installed 2_1_11 & 2_1_12
- RSV asked me to download tar with a few modified files with fixes for the gridftp_hosts bug I reported. I did so, it worked fine. However, forgot to copy old OSG release with -p tag, so couldn't roll back to 'pure 1.0.0' release. Left with minor changes from RSV tar. Changes were made in the files in /root/config/src, which were copied into /share/apps/osg/monitoring. Hopefully won't cause any problems when it's time to perform updates - don't anticipate it will. Kept 'pure 1.0.0' release with incorrect permissions in /share/apps/osg-1.0.0-backup-incorrectPermissions-20081031, just in case conflicts do occur. Can get correct file permissions from files in /share/apps/osg-1.0.0-backup-slightlyModifiedRSV-20081031/monitoring and the list of modified files from /root/config/src. Will be difficult to merge, so will only do when/if it becomes necessary.
- Removed CMSSW_2_1_2, 2_1_3, installed 2_1_11 & 2_1_12. ~7GB of space remaining.
October 29, 2008
MK -- Worked on failing RSV srm tests, tested OSG with CRAB, all OK!
- Failing both srm-ping-probe and srmcp-srm-probe. Turns out that BeStMan-Gateway not yet available from VDT. Tried test-cache version instead, although hit a few problems. Used existing BeStMan install via tarball and will try again once VDT releases officially. Now passing srmping, but not srmcp. srmcp works from the command line and PhEDEx believes all links are up. Checking RSV srmcp error log shows that RSV isn't specifying the rsvuser grid proxy correctly. Sent email to Arvind.
- CRAB submission of job reading DBS data worked, OSG now appears OK.
October 28, 2008
MK -- OSG 1.0 running, not tested
- Was able to start all services successfully, failing some RSV tests.
October 27, 2008
MK -- Began OSG 1.0 install
- Lots of confusion on what commands need to be executed before configuration is called and what need to be executed after.
October 21, 2008
MK -- Updated VDT-Version and LFC-Client as per GOC ticket #5709
- Super-stupid thing for me to do. Turns out the update was intended for OSG 1.0 sites only. Site now failing almost every RSV test. Tried rolling back to older backup of OSG directory, but no dice. globus-ws and osg-rsv services won't start.
October 20, 2008
MK -- Updated VO package, installed CMSSW_2_1_10
- As per GOC ticket #5768. Didn't really appear to do anything, or it was lightning fast.
- No problems with 2_1_10 install, though running out of space for CMSSW releases.
October 18, 2008
MK -- Performed HN reboot
- Problem with Wordpress "not able to select the wordpress database" and MySQL web interface not showing any databases was fixed due to HN reboot. Updated Admin How-To "errors" section appropriately.
October 17, 2008
MK -- Continued security configuration
- Check /root/security.txt for details.
October 9, 2008
MK -- Wordpress not working
- All other monitoring utils recovered, but Wordpress is not. "We were able to connect to the database server (which means your username and password is okay) but not able to select the wordpress database." Attempted restart of mysql, mysqld, apache, httpd, tomcat-5. Wordpress issue may be due to problems with new users, user DB may be corrupted.
October 8, 2008
MK & MT -- Monitoring utils have stopped running - RSV & Ganglia. Added new users.
- Stopped some services. Restarted ganglia and others. Restarted OSG-RSV. Restarted BeStMan for good measure.
- Added new users. Issue with being unable to change password on first login. Later resolved - either user error or a call to rocks sync users did the trick.
October 7, 2008
MK -- Security
- Continued security modifications. Check /root/security.txt for details.
October 6, 2008
MK -- Continued security modifications, updated Admin How-To guide
- Details on latest security modifications can be found in /root/security.txt.
- Updated the admin how-to guide based on feedback from Marguerite from the power down and up practice, as well as WN reinstall.
October 5, 2008
MK, MT, & NH -- Practiced cluster power down and power up, performed WN reinstall
- Trained Marguerite and Nick on procedures to power down and power up the cluster. All was well - even RSV went to the correct condor queue. cacert RSV probes don't seem to have run at all recently though. /data seems to be network mounting on the WNs just fine after reboot. I must have fixed this at some point, but can't find it anywhere in my logs or guides - I don't know how I did it, but I did!
- Took cluster down time as opportunity to reinstall the WNs (shoot-node). Had a few minor issues, simple logic flaws in extend-compute.xml, easily resolved. However, /etc/krb5.conf is STILL not overwriting the old krb.conf, or something is moving it to krb5.conf.1. Placed Kerberos stuff after gLite install (possibly unnecessary), then manually moved /etc/krb5.conf to /etc/krb5.conf.old so that wget will really try to name it /etc/krb5.conf. Haven't tried again, will have to wait until next shoot-node to see if it worked. Otherwise, all is well.
October 3, 2008
MK--Replaced RSV probe
- cacert-crl-expiry-probe in our 'hacked' RSV V2 can't handle the new vdt-update-certs package (released Sep 11 and upgraded on our cluster on Sep 30). Downloaded new probe on Arvind's instructions and replaced existing probe in $VDT_LOCATION/osg-rsv/bin/probes. Last timestamp before update 11:48.
October 2, 2008
MK--BeStMan needed a kick
- Need to add auto-kicks to cron, once a week I should think.
October 1, 2008
MK--Began commissioning PhEDEx links from T2_US_UCSD & T2_US_Purdue, improved ssh security
- Debug link created, injection dataset created, and transfer request approved. Will monitor progress over next week and send an email to HN if Prod link should be created.
- Check point (3) in /root/security.txt to get details of ssh security