February 2009 log

Instrumented cluster with Dell OpenManage. Updated Squid, PhEDEx, & VDT. Handled problems caused by PhEDEx dominating the SE services.

February 27, 2009

MK -- Kicked BeStMan, reconfigured PhEDEx

PhEDEx is just hammering on BeStMan, it's monopolizing my HN services and eventually crashing BeStMan. srmcp commands are invariably timing out even when BeStMan is still running because PhEDEx is sucking up all the bandwidth. Created the /root/KickBeStMan.sh script and added to cron:
2 1 * * 0,2,5 /root/KickBeStMan.sh
So that at least I can handle BeStMan crashing from PhEDEx and other factors. But still need to handle PhEDEx monopoly. I also reduced the frequency of logrotate, because PhEDEx may have exponential backoff routines that don't have time to stabilize in just one day of running. Decreased all the way to just one logrotate a month. Also emailed the PhEDEx hypernews and they suggested I use the parameters -batch-files and -jobs in my download agent configuration. Tried 10/3 for an hour but RSV and SAM still failed, so tried 5/2. After an email from Paul and another hour of running and still failing tests I tried 2/2. We are now passing RSV and SAM tests.

MK -- Dealt with RDC temperature problems, kicked BeStMan, looked at FileDownloadVerify using cksum instead of srmls

RDC experienced loss of cooling. Received early morning phone call and decided to shut down cluster. Had received "all OK" email from RDC at nearly the same time as phone call, but had to assume that phone call was because situation had worsened. So opted to not rely on node's auto-shutdown ability while commuting to turn off the UPS and disk array. Temperature never exceeded 80 deg. F, so shutdown ended up not being necessary. Two issues related to RDC failures called out from this experience:
- Need to ask caller if the situation is worsening.
- Need to get ability to remotely shut down UPS, so that I can turn off both UPS and disk array remotely.
PhEDEx downloads stopped and seem to have a lot of errors, so kicked BeStman.
Was not able to write a script which worked to replace srmls with cksum (not even sure that current checksum script works), however discovered that FileDownloadSRMVerify is generally not being called with the checksum option. The list option is what's used by default to get file size and that already uses my system's ls command, not srmls. So it's doubtful that PhEDEx is creating the strain on the HN due to FileDownloadSRMVerify. Probably strain is due purely to srmcp calls.

MK -- Installed and configured OpenManage on all WNs

Used shell scripts and cluster-fork to install. Still need to implement inside Rocks post install kickstart file.

MK -- Tested WN OpenManage install

Managed to install on phedex node, will have to figure out all the alertactions to set (omconfig system alertaction -? lists all available alerts). Will use the web interface for the HN OpenManage to do a first round, then will map them to the appropriate omconfig command so I can script the settings for all the WNs. Need to install on all other WNs, as well as putting into the Rocks kickstart.

MK -- Installed Dell OpenManage on HN

Configured HN to shut itself off in the event of temp warnings. Not yet configured to email warnings for other problems.

MK -- Installed CMSSW_2_2_4, updated Squid client

Only 1GB of space remains for CMSSW releases.
Installed release 4.0rc6 from SquidForCMS tarball. Previous modification to Squid cache size somehow stopped Squid from reporting. Upgrade to 4.0rc6 somehow fixed the problem. Perhaps the last restart of the Squid service didn't go right.

MT -- Updated to VDT 1.10.1t

MK -- Updated PhEDEx to 3.1.2

Actually performed as an update instead of a fresh install. Copied 3.1.1 to 3.1.2, pointed the current link at 3.1.2, and performed the update there. All appears fine, but I can point back to 3.1.1 if transfers fail later. Updated tarball and Rocks Kickstart info to do 3.1.2.

MK -- Reduced Squid cache size to 5GB, installed CMSSW_2_0_12, now passing SAM tests

Just not enough room for CMSSW releases on /scratch, so reduced Squid cache size to 5GB.
Installed CMSSW_2_0_12 to pass SAM tests. CMSSW_2_0_12 is used by SAM.
Subscribed to /QCD_pt_0_15/SAM_IDEAL_V9_SAM/GEN-SIM-RAW-RECO and downloaded srm://cmssrm.fnal.gov:8443/srm/managerv2?SFN=/11/store/user/test/oneEvt.root to srm://hepcms-0.umd.edu:8443/srm/v2?SFN=/store/user/test/oneEvt.root. Also set the correct permissions on all the /store/mc/unmerged subdirectores, which had been made some time ago with incorrect permissions, before I fixed BeStMan to run under sudo. Now passing all SAM tests for the first time!

MK -- Updated CMSSW CVS configuration using apt

Contacted sw-develtools hypernews and was told the appropriate CMSSW CVS updates could be retrieved by calling apt-get --reinstall install cms+cms-common+1.0-cms2 and apt-get install cms+cms-cvs-utils+1.0-cms, so called and configuration appears to be updated.
Installed CRAB_2_4_4
Discovered issue with CRAB job failures was my white_list arguments. CRAB no longer accepts all three umd.edu, UMD.EDU and T3_US_UMD. Now only using SE T3_US_UMD and CE UMD.EDU works.

MK -- Manually updated CMSSW CVS configuration

The cms-cvs-utils package has an older configuration based on CERN's Kerberos IV endpoint. CERN recently updated to Kerberos V. Edited /software/cmssw/cmsset_default.sh, /software/cmssw/cmsset_default.csh, /software/cmssw/slc4_ia32_gcc345/cms/cms-cvs-utils/1.0/bin/projch.sh, and /software/cmssw/slc4_ia32_gcc345/cms/cms-cvs-utils/1.0/bin/projch.csh and replaced all instances of the phrase :kserver: with :gserver: .