Log

Jump to: | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 |

2013:

Log has been moved offsite to be available for sysadmins only

2012:

April-May 2012
- Major outage in May for upgrade to OSG3.0 , CMS grid software changes, and other updates. One disk in RAID-5 /data replaced, hadoop upgraded. In progress ...
January-March 2012
- Debugged hadoop problems: R510 nodes lost fuse contact with /hadoop: fixed with TCP tuning and channel bonding. R510-0-5 (now 17) motherboard replacement and replaced 2 data disk failures for that node. SE-0-1 DIMM_A3 error (replaced). Old /data/se/store cleaned (was replicated to /hadoop/store). R510 firmware updates to resolve spurious HV error in omreport system.

2011:

October-December 2011
- Install gridftp-hdfs to implement Hadoop as SE for grid jobs. Debugging randomly failing gridftp-simple RSV probes.
April-September 2011
- Outage August 5-7 including software and firmware upgrades. Make more user groups. Debug R510 nodes not coming online after power outage. CRAB upgrade. Debug SE-0-1 memory problems. Install Hadoop. Install GUMS. Outage September 21-22 to install two new R510 nodes, and replace PDU/strip hardware.
January-March 2011
- Rocks 5.4 upgrade; condor configured separately from rocks (includes new nodes); OSG upgrade.

2010:

October-November 2010
- Racked new equipment.
March-September 2010
- New site certs, enabled public_html, modified NFS /data mount settings, updated software..
January 2010
- Mapped affiliated users to special grid accounts, created special low priority user.

2009:

December 2009
- Modified garbage collection jobs, upgraded to SL5.
November 2009
- Kernel updates, updated to OSG 1.2.3, TCP tuning on the grid node for PhEDEx.
October 2009
- Updated software. Discovered that OMSA regularly dies on nodes, so need to kick it regularly. Hopefully OMSA for RHEL5 won't have this problem.
September 2009
- Kernel updates, debugged PhEDEx hanging, installed new CRAB releases.
August 2009
- Reconfigured cluster to add grid and interactive nodes. Installed OSG 1.2. Now installing CMSSW via automatic jobs from Bockjoo. Installed PhEDEx 3.2.0 and switched to using srm-copy instead of srmcp.
July 2009
- Fixed condor configuration so that jobs can be suspended but not evicted for performance or priority reasons. Installed SuperB FastSim software temporarily, waiting on decision from Nick whether we will allow them permanent installation access.
June 2009
- More failure alerts issued by OpenManage, but problems didn't seem to show up anywhere.
May 2009
- Installed OpenGL on all nodes, updated RAID firmware and drivers, kickstarted all WNs.
April 2009
- Updated gLite-UI, fixed gLite submission problem to UMD, installed graphics tools on phedex node.
March 2009
- Primarily CMSSW release installs and removals. Some attempts to debug outstanding issue with CRAB submission via gLite scheduler to UMD.
February 2009
- Instrumented cluster with Dell OpenManage. Updated Squid, PhEDEx, & VDT. Handled problems caused by PhEDEx dominating the SE services.
January 2009
- Experimented with various condor configuration settings. Ultimately opted for PREEMPTION_REQUIREMENTS = False so that one Condor job will not boot another which is already running. All other attempts to change Condor configuration settings resulted in nodes going into memory swaps. Also got all WNs to provide their status via calls to ipmish. HN does not yet respond to ipmish. Still need to have an automated system which will monitor all nodes and send a text message in the event of failure.

2008:

December 2008
- Performed PhEDEx, Squid, and VDT upgrades. Updated network configuration (subnet mask changed).
November 2008
- Primarily debugged PhEDEx and srm transfers, learned that third party pushmode is generally unreliable and a couple failed transfers are not cause for alarm. Also updated OSG and debugged CRAB private DBS registration.
October 2008
- Commissioned PhEDEx links to UCSD & Purdue, practiced power down and up procedure with Marguerite, added security (check /root/security.txt for details) & upgraded to OSG 1.0.
September 2008
- Primarily small upgrades and maintenance, as the cluster has reached full operational status. Installed a number of CMSSW_2_1_X releases.
August 2008
- Got the cluster fully operational and able to provide all desired services. Made them all possible from interactive WNs, so users only have to login to the HN to change their password. Successfully tested CRAB job submission to the site. Dealt with a power loss at the RDC.
July 2008
- Fixed a number of small, but critical issues. Configured PhEDEx to use storage element properly, so we show up as hosting data in DBS. Commissioned PhEDEx links to FNAL & Nebraska. Put the website into a presentable form and gave presentation about work to All USCMS meeting. Worked on gLite-UI installation, but did not complete.
June 2008
- Advanced Rocks configuration, grid installation and configuration.
May 2008
- Got familiar with Rocks basics and installed non-grid software.
April 2008
- Got the cluster operational. We performed basic tasks such as OS installation, partitioning, and networking.

UMD HEP T3 Computing Cluster

Log

2013:

2012:

2011:

2010:

2009:

2008: