September 2008 log

Primarily small upgrades and maintenance, as the cluster has reached full operational status. Installed a number of CMSSW_2_1_X releases.

September 30, 2008

MK & MT--Updated OSG CA-Certificates-Updater, trained Marguerite

Followed instructions here and called pacman -update CA-Certificates-Updater. There is a new RSV probes package that won't show us failing due to cert issues, but we're running a 'hacked' RSV V2, so I chose not to call pacman -update OSG-RSV-Probes. Logging in as rsvuser showed the RSV jobs are running in the correct condor queue. RSV probes successful so far, will check again once they've run cert checks, voms-proxy-init also OK.

MK--Created LQ group, fixed ssh kerberos tickets

Added appropriate users to LQ group and created /data/groups directory. Created LQ subdirectory and gave appropriate permissions so any group member can modify.
Despite ability to get Kerberos ticket from FNAL, ssh wasn't picking up the ticket. Fixed the problem by editing /etc/ssh/ssh_config and adding GSSAPI tags.
For unknown reason, WNs couldn't get FNAL tickets. Possible that Rocks kickstart file is not correctly picking up the file from the HN during installation (TODO).

MK & MT--Installed CMSSW_2_1_8, kicked BeStMan

MK--Concerned about cluster activity, CMSSW installs, PhEDEx

CMSSW_2_1_5 is unstable, so removed. Installed CMSSW_2_1_7.
Ganglia shows unusual activity on WNs, having short and frequent spikes, similar to what I would see on the PhEDEx node. Concerned by this unusual behavior.
PhEDEx node didn't show expected level of activity. PhEDEx website showed links down in both Debug & Prod instances. RSV shows BeStMan is still up, so don't believe is problem with SE. Nebraska has been having troubles, but FNAL link should be up, so restarted PhEDEx services.

MK--Began training MT as backup admin

Need someone who can fill admin role while MK is elsewhere. Began training on basic admin tasks.

MK--Checked CRAB jobs don't run on NFS mount & looked into log rotation

CRAB jobs are first sent to /home/uscms01 (NFS mount) by OSG. They grab the correct environment variables and move themselves to the correct WN local directory, /tmp. However, they unzip themselves first into /home/uscms01 and leave behind a log trail. There is a way to configure maradona-based jobs (CRAB jobs submitted via gLite) such that they will go to /tmp in the first place. However, since we expect most CRAB jobs will be scheduled using condor_g, this is not useful. Not a major performance impact, since most files will be made in /tmp. Also will not have a bad performance impact on the new CMSSW "lazy download" option, which will stage files in chunks to local disk from the SE - great for us, since our SE is network mounted.
Appears a number of services already rotate their logs already. The big one to take care of are the PhEDEx logs. Should use logrotate for individual files in .../logs. Can use find (maybe?) to safely remove any directories in .../state/.../archive w/o crashing PhEDEx.