January-March 2011 Log

New DIMMs for R510s, installed Rocks 5.4, kickstarted SE.

To Do & Sandbox

March 3, 2011

MT -- Updated OSG; Updated CRAB

Updated OSG to 1.2.18. Updated CRAB to 2_7_7_patch1. Restarted Grid sevices, they appear to work fine.

February 13, 2011

MT & MK -- Installed OSG, updated xml files, brought cluster back up

Setup external internet on nodes that needed it. Installed OSG. Updated Condor settings. Updated xml files to not format hadoop directories upon kickstart. Re-kickstarted most nodes. Started OMSA (HN, GN). Installed SQUID. Installed CRAB and gLite on INs. Tested services, they appear to work.

February 12, 2011

MT & MK -- Kickstarted nodes & configured Condor

Kickstarted WNs, R510s, SE node, IN nodes, and grid node. /data remained untouched and automounts correctly. Did automount of /sharesoft (CMSSW). Restored cron (on frontend). Worked on condor installation (separate from previously used roll method) and configuration of condor. Compute-0-10 is not up.

February 11, 2011

MT & MK -- Brought cluster down for weekend outage: installed Rocks 5.4 on HN

As part of the upgrade to Rocks 5.4, we installed Rocks 5.4 on the HN. We did not format the /home directory, it has remained. We added the User accounts, and did a number of yum updates on the HN.

February 4, 2011

MK -- Kickstarted new SE

In preparation for a separate SE with Hadoop installed, Kickstarted new R410 node, brought up as SE-0-2. As usual, BMCs issued DHCP requests that were intercepted by insert-ethers during Kickstart, which is why the node doesn't have number SE-0-0. Also, the additional NIC caused very strange behavior during PXE boot. Had to plug network cables into both port 1 on the integrated NIC and the first two ports (possibly just the first port) on the additional NIC. Dumb guess: PXE boot has to be off the BIOS integrated card, but then the system relabels the eth ports on the additional NIC starting with eth0 and Rocks only tries to talk to eth0, eth1, and eth2 during Kickstart. The additional NIC has four ports, so suspect that Rocks was only talking to the additional NIC while the BIOS was trying to PXE boot off the integrated NIC. Regardless, it's not actually clear which of the eth cables plugged into the switch is the one that Rocks has put the internal network on. /etc/modprobe.conf lists eth0->eth3 as type igb and eth4 and eth5 as type bnx2, supporting the idea that the additional NIC is labeled in Linux as eth0->eth3 while the integrated NIC (which has two ports) is labeled as eth4 and eth5.

February 2, 2011

MK -- Shot all R510s, fixed condor slots on R510s

Specified NUM_CPUS=24 in the condor configuration script on R510-0-3, restarted condor service, and condor_status now shows only 24 slots on nodes. Don't know why there were 36 slots, but seems to be fixed.
Set config.ini cores_per_node to 24, though this might not be the right thing to do, also increased node_count to 5. Have not restarted OSG services, but BDII already shows new subcluster, so restart is probably not required.
Modified R510.xml to add the NUM_CPUS line to the condor configuration, kickstarted all R510s.

February 1, 2011

MK -- Modified config.ini for new R510

Added [Subcluster umd-cms-ce-2] to CE config.ini. Not yet reconfigured OSG, waiting for response on the appropriate value for cores_per_node when using hyperthreading.

January 31, 2011

MK -- Installed new DIMMs, updated R510 BIOS, kickstarted R510-0-3

Changed R510.xml Kickstart to preserve existing partition structure in preparation for /store data on Hadoop, which we don't want to lose after Kickstarting R510 nodes (successful). Also brought R510 nodes into condor pool. Only Kickstarted R510-0-3 since all other nodes issued recent RAM failures. Was able to bring R510-0-3 into condor pool, but with 36 batch slots instead of 24 (???).
Received 4 new DIMMs to replace failing memory on new R510 nodes. However, experience more failures in the intervening months on the nodes, so was not able to replace all DIMMs which issued errors. Status of errors (after firmware update refers to the firmware update in Nov, not the firmware update today):
GLQ: Errors on A2, A3, B2 after upgrade firmware and log clears;
errors on B1 & A3 prior to upgrade firmware and log clears
- Early Nov: errors on DIMM B1 & A3, swapped B1 & A3 (wasn't aware of
  errors on A3), new error on B1
- Mid-Late Nov: upgraded firmware, cleared NVRAM, cleared BMC logs
- Nov 29: critical error on B2
- Dec 13: critical error on A3
- Jan 28: non-critical error on A3
- Jan 30: critical error on A2
- Jan 31: upgraded BIOS to 1.5.4, replaced DIMM A3 with new DIMM
2MQ: Errors on A3, B1, B2 after upgrade firmware and log clears;
errors on B3 (currently swapped with A1), A2, B2 prior to upgrade
firmware and log clears
- Early Nov: Errors on A2, B1, B3, swapped A1 and B3, new error on B2
- Mid-Late Nov: upgraded firmware, cleared NVRAM, cleared BMC logs
- Nov 29: critical error on B2
- Dec 1: critical errors on A3 and B1
- Dec 27: critical error on B1
- Jan 27: critical error on B2
- Jan 31: upgraded BIOS to 1.5.4, replaced DIMM B2 with new DIMM
3MQ: Errors on A3, B2, B3 after upgrade firmware and log clears;
errors on B2 (currently swapped with A2) prior to upgrade firmware and
log clears
- Early Nov: Error on B2, swapped with A2, log issues in this node
  make it unclear if further errors were issued
- Mid-Late Nov: upgraded firmware, cleared NVRAM, cleared BMC logs
- Nov 29: critical error on B2
- Dec 1: critical error on B3
- Dec 28: critical error on B3
- Jan 27: critical error on A3
- Jan 31: upgraded BIOS to 1.5.4, replaced DIMM B3 with new DIMM
1MQ: Errors on B1, B2 after upgrade firmware and log clears;
errors on B2 (currently swapped with A1), B3 prior to upgrade firmware
and log clears
- Early Nov: Errors on B2 & B3, swapped B2 and A1, log issues in this
  node make it unclear if further errors were issued
- Mid-Late Nov: upgraded firmware, cleared NVRAM, cleared BMC logs
- Nov 30: critical error on B1
- Dec 1: critical error on B2
- Jan 27: critical error on B1
- Jan 31: upgraded BIOS to 1.5.4, replaced DIMM B1 with new DIMM

UMD HEP T3 Computing Cluster