January 2009 log

Experimented with various condor configuration settings. Ultimately opted for PREEMPTION_REQUIREMENTS = False so that one Condor job will not boot another which is already running. All other attempts to change Condor configuration settings resulted in nodes going into memory swaps. Also got all WNs to provide their status via calls to ipmish. HN does not yet respond to ipmish. Still need to have an automated system which will monitor all nodes and send a text message in the event of failure.

To Do & Sandbox

January 27, 2009

MK -- Changed condor config settings

Nodes started doing memory swaps last night, so removed my jobs, removed all changes to condor_config.local on all nodes. Decided to just try PREEMPT=False, because from condor user guide, it seemed like preemption is used only to kick out one job in favor of another, which is ultimately what I want to turn off the most. However, had memory swap problems again, so read up even more on preemption, etc. Finally found a clear guideline in this condor FAQ. Set PREEMPTION_REQUIREMENTS = False so that jobs will preempt when condor deems it necessary, but won't do it because a higher priority user has idle jobs.
Restarted BeStMan, OSG probes were returning unknown status. Now passing srm-ping-probe. srmcp-srm-probe is still in unknown status, possibly due to issue Eric mentioned regarding the new requirement of using equal signs in all args passed to srmcp.

January 26, 2009

MK -- restarted condor on all nodes to pick up new config settings, updated VDT to 1.10.1s

Experiencing lag on WNs due to HN network saturation. Unsure if this is because of new condor configuration settings or disk-intensive jobs that I submitted. To properly verify, I would need to roll back to old condor configuration settings and submit the disk-intensive jobs again. Suspect due to nature of disk intensive jobs, not condor settings.
VDT for both CE and WNclient according to these directions.

January 23, 2009

MK -- Modified /opt/condor/etc/condor-config.local

Condor jobs seemed to have an undesirable level of churn. Realized this was because HN was operating with SUSPEND=False while the WN's were operating with default SUSPEND criteria. Ultimately, this meant all grid jobs were dominating non-grid jobs. Revisited the issue of churn and found this particular section in the condor manual:
NOTE: If you have machines with lots of real memory and swap space such that the only scarce resource is CPU time, consider using defining JOB_RENICE_INCREMENT so that Condor starts jobs on the machine with low priority. Then, further configure to set up the machines with:
```
        START = True
        SUSPEND = False
        PREEMPT = False
        KILL = False  
```
In this way, Condor jobs always run and can never be kicked off. However, because they would run with "nice priority'', interactive response on the machines will not suffer. You probably would not notice Condor was running the jobs, assuming you had enough free memory for the Condor jobs that there was little swapping.

Checking Ganglia showed that even the heaviest use nodes only had 10GB of memory used, although some had the remaining 6GB cached. Confirms that memory is not an issue for CMSSW jobs, so this configuration is ideal. Will need to reevaluate when Marguerite does some Heavy Ion production, which may stress memory.

January 7, 2009

MK -- Updated OSG CEMon-BDII collector, removed CMSSW_2_1_0 and CMSSW_2_1_10

Now reporting to two collectors, as detailed here.

January 6, 2009

MK -- Installed Dell Remote Access Controller and Baseboard Management Controller software from OpenManage CD

Need to make changes in BIOS and remote access controller settings before the software will get the temperatures and email emergencies. Tentatively scheduled HN downtime for Saturday the 17th.
Attempted to create Rocks restore roll, but it crashed with seg fault. Had previously run out of space (placed data in /var/tmp), but created link to /data of relevant directory. Next time through, didn't run out of space but had seg fault. Will just back up critical files, perform first tests on a WN, and keep fingers crossed when I touch the HN.

January 5, 2009

MK -- Installed CMSSW_2_2_1 and CMSSW_2_2_3