August 2008 log
Got the cluster fully operational and able to provide all desired services. Made them all possible from interactive WNs, so users only have to login to the HN to change their password. Successfully tested CRAB job submission to the site. Dealt with a power loss at the RDC.
August 29, 2008
MK--Installed ImageMagick
- ggv appears to be very slow for some users. ps2pdf followed by gpdf is fast, but adds a layer of complication. ImageMagick provides display, although display tends to rotate the image.
August 27, 2008
MK--Restarted condor service on nodes & edited replace-condor-client.xml
- Realized that changes to SUSPEND functionality had not gone into effect b/c WN condor clients had not been restarted. cluster-fork'ed a restart once existing jobs completed. Marguerite submitted enough to fill the slots and subsequent user login did not result in job eviction. Jobs suspended in the middle of last night were possibly interrupted by cron jobs at regular intervals, exactly every two hours @ minute 44. Also possible these are 'pings' from OSG monitoring. Not RSV, which runs in development queue.
- Also realized that I had not implemented changes to SUSPEND condor functionality in the rocks distribution. Edited replace-condor-client.xml, the Admin guide, and rebuilt the distribution. Will test with the next WN reinstall.
August 26, 2008
MK--Installed LaTeX and ggv (.ps viewer) on WNs
- Verified that latex and dvipdf works. dvips attempts to print to a default printer, which is naturally not installed on the cluster, so dvips fails.
- ggv works fine on the WNs. Installed all packages using cluster-fork and added to new rocks distribution.
August 21, 2008
MK--Removed condor suspend functionality, recovered compute-0-0
- Will need to keep an eye on impact to interactive users.
- compute-0-0 went down yesterday with error printing on LED display: "E1410 CPU Machine Chk E1410 CPU 2 IERR." Was able to power down and boot once successfully, but BIOS halted on reboot and subsequent attempts to power down and up. Contacted Dell support & they had me reseat the CPUs. Upon reseating, error code went away and node came up and rebooted with no issues.
August 18, 2008
MK--Backup
- Backed up files, created script. Script needs to auto-copy to hep-t3 webserver, then needs to be added to cron.
August 15, 2008
MK--Brought cluster back up from RDC power down
- Initial boot wouldn't mount the big disk. Didn't give the big disk array enough time to power up before turning the HN on. Fixed, wrote up.
- Wrote power down and power up procedure.
- RSV once again was running in the wrong production queue. Fixed, wrote up.
- crl RSV prove once again failing due to cron job not running recently. Fixed, wrote up.
- Took the opportunity to shoot the WNs (except for the PhEDEx node) to get latest changes.
- Added note to user guide about CRAB versions and OSG 1.0 issues.
- Added note to grid policy about 2GB memory limit.
August 14, 2008
MK--Successful CRAB submission to the cluster
- Needed $OSG_APP/cmssoft/cms to point to /software/cmssw instead of $OSG_APP/cms. Fixed and now condor_g jobs run just fine.
- glite appears to be OK, though not as thoroughly tested - too long wait time on the RB.
- Jobs which contact Frontier appear to run OK, but SLOWLY.
August 13, 2008
MK--Installed g77, debugged CRAB
- ALPGEN needs FORTRAN compiler, so installed g77 on all nodes (cluster-fork, not kickstart, added to kickstart, will distribute on next cluster shoot).
- WNs needed lcg-CA package installed, got successful condor_g submission to the site, though CMSSW.stderr claims it can't find voms-proxy-info.
- glite submission to cluster resolved by using white_list=UMD.EDU, WMS is caps-sensitive. Jobs fail with osg_tmp directory permission denied.
August 12, 2008
MK--Installed CMSSW_2_1_0, added SITECONF directory to CMSSW installation, removed CRAB appliance, tested CRAB submission to site
- CMSSW_2_1_0 didn't damage the existing 1_6_12 install. Unclear if CMSSW_2_1_0 is functional, as existing config throws errors (even at cmslpc).
- SITECONF directory doesn't cause CMSSW to choke as it did before, probably because site-local-config.xml is configured correctly this time!
- Removed CRAB appliance from Rocks DB and renamed crab-node-0-0 back to compute-0-6. Asked Jules to extend the hepcms.umd.edu alias to include hepcms-7.umd.edu, done!
- CRAB_2_3_1 doesn't support condor_g scheduler and datasetpath=none. Eric suggested glite as scheduler, but this didn't work. Installed CRAB_2_3_0 on compute-0-4, used condor_g as scheduler, and jobs submitted OK; using glite as scheduler failed, as did glitecoll.
August 11, 2008
MK--Installed Kerberos-enabled CVS on all WNs, installed Squid proxy for Frontier
- WNs run an older version of Kerberos than the HN, so they choked on Kerberos authentication. Used yum localinstall (and RPMforge) in extend-compute.xml <post> to install.
- CMSSW contacts Frontier for conditions. Frontier requires a squid proxy. Installed it.
August 6, 2008
MK--Installed CRAB on all WNs, wrote user guide for CRAB
- Ported CRAB code to extend-compute.xml instead of special CRAB appliance, now available on all WNs (except phedex node).
- Need to check that SE stageout to hepcms will work using webservice_path
- Edited user guide for CRAB.
August 5, 2008
MK -- Installed CRAB on crab node
- All checked out OK doing manual install (submitted and retrieved Pythia job successfully). Copied commands to crab-node.xml and installed via shoot-node.
- Need to move to all WN's and change hepcms.umd.edu to include hepcms-7.