How To: Guides for users and Maryland T3 admins.

Help: Links and emails for further info.

Configuration: technical layout of the cluster, primarily for admins.

Log: Has been moved to a Google page, accessible only to admins.

Configuration

NOTE: This is an archived version of the site configuration. It is no longer valid for our current setup, but is kept for reference purposes. Please access the new configuration here. This configuration is the one used for the archived version of the admin how-to guide.

The UMD HEP T3 cluster is composed of one HN and eight WNs. After RAID and formatting, users have ~400GB disk space for their /home area and ~10TB disk space for large datasets (/data). Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).

Last edited August 12, 2009

Table of Contents

 

Node Roles

The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS to the WNs via Rocks Kickstart files and is also the OSG storage element (SE) and compute element (CE) HN. We have eight worker nodes (WNs), six of which are designated as user-interactive nodes. Of the remaining two WNs, one is the PhEDEx node and the other is the CRAB node, both are extensions of the basic worker node. All eight WNs are Condor processing nodes.

Head node:

external name: hepcms-0.umd.edu
internal name: hepcms-0

  • Rocks head
  • Job submission point to WNs via Condor (interactive users) and Condor_g (grid users)
  • Grid storage element (SE) & computing element (CE) head
  • Services SE requests with BeStMan
  • Stores users' /home area, which is visible to the WNs
  • Controls big disk via DAS cable and PERC6/E controller
  • Ganglia monitor and web server
  • Hosts most software, usable by the WNs
  • Provides internal network gateway
  • Squid web proxy for Frontier (CMSSW conditions database)

Having one node fulfill all three important roles of Rocks HN, OSG CE, and OSG SE is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster, where a loss of two additional nodes is a 25% loss of our computing power. Future upgrades, should they occur, will split these roles across multiple nodes.

Seven interactive worker nodes:

external name: hepcms.umd.edu points to hepcms-1.umd.edu - hepcms-7.umd.edu
internal names: compute-0-0 to compute-0-6

  • Services CE (Condor_g) jobs sent via the HN
  • Services Condor jobs sent via any node
  • Job submission point to WNs via Condor (interactive users) - this capability is currently disabled, but will be implemented
  • Runs user interactive jobs
  • Stores CMSSW temporary output in /tmp
  • Installs gLite-UI & CRAB in /scratch

One note of import is that gLite-UI is not compatible with services provided by the Rocks HN. So CRAB, based on gLite-UI, cannot be installed on the HN. However, CRAB does support job sumission using CrabServer, which does not require gLite-UI. It is unknown if CrabServer without gLite-UI can submit jobs to European sites. Some implementations of PhEDEx also run on top of gLite-UI, but use a different site configuration file. In this case, both CRAB with gLite-UI and PhEDEx with gLite-UI cannot run on the same node. Since our PhEDEx installation does not use gLite-UI, we can install gLite-UI as a part of normal WN Kickstart files, which the PhEDEx node inherits.

PhEDEx worker node:

external name: hepcms-8.umd.edu
internal name: phedex-node-0-7

The PhEDEx WN inherits all the functionality of the seven interactive WNs, with the additional feature:

  • PhEDEx installed and configured

There is no particular reason to put PhEDEx on more than one node. Some implementations of PhEDEx run atop gLite-UI, which is not compatible with the Rocks HN services. Thus, PhEDEx must be installed on one of the WNs. PhEDEx tends to hammer at the network resources of a node, which is why the PhEDEx node probably should not be used for user-interactive tasks. However, PhEDEx does not drastically consume any other resources local to the node, so there is no need to take the PhEDEx node out of the condor compute pool either. Thus, designating one WN as the PhEDEx node, which inherits all the functionality of normal WNs, makes sense. Users are not prohibited from logging in and running jobs interactively, but hepcms.umd.edu will not direct users to this node.

Hardware

HN: Dell PowerEdge 2950

  • Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
  • 8GB 667MHz RAM
  • PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (users' area and applications), ~420GB
  • PERC6/E : controls all 15 physical disks of PowerVault MD1000 (big disk), configured as RAID-6, ~10 TB

WNs: Dell PowerEdge 1950

  • Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
  • 16GB 667MHz RAM
  • 80GB primary disk
  • 250GB /tmp disk

PowerVault MD1000 (aka big disk)

  • DAS
  • 15 750GB 7.2K RPM SATA 3Gbps hard drives
  • Controlled by PERC6/E controller in HN

PowerConnect 6224

  • Managed switch
  • Stacking capable
  • 24 GbE ports

APS 2200 VA

  • 120 Volt UPS
  • Network controllable (currently not configured)

 

Partitions

Head node:

/dev/sda  69374, RAID-1 67.75 GB physical disks 0:0:0, 0:0:1 :
root/      8189 /sda1 ext3
swap       8189 /sda2 swap
/var       4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch  48901 /sda5 ext3

/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/export  418168  /sdb1 ext3

/dev/sdc 9744877, RAID-6, 8.9 TB 15 physical disks
(Logical volume)
/data    9744877 /dev/mapper/datastore-cmsdata0 xfs

  • /scratch is meant primarily for CMSSW and is network mounted on all nodes as /software (CMSSW is in /software/cmssw). Other software can be installed here, but will reduce the space available to CMSSW. To install other software here (discouraged, use /export/apps if you need a network mount), install it in the directory /software/other.
  • Squid is installed to /scratch/squid, which is not network mounted (Squid doesn't like network mounts or RAID-5). Squid can use up to 5GB on /scratch. Squid is needed for contacting the Frontier CMS conditions database, which is a part of CMSSW. With ~50 GB in /scratch, ~5 GB/CMSSW release and 5 GB for Squid, /scratch can support ~9 CMSSW releases. We may reduce the size allocated to Squid at a later time; however 5 GB is already a quarter of the minimum recommended.
  • /export contains the users' home area as well as much of the software (besides CMSSW)
  • We were not able to find explicit details on how /export is handled on subsequent Rocks upgrades, but we believe that /export is preserved between reinstalls.
  • Applications install in /export/apps on the HN and auto-network-mount as /share/apps on the WNs and the HN.
  • The users' home area is in /export/home and network mounts as /home on both the HN and WNs.
  • Because the /home/user and /share/apps sub-directories are auto-network-mounted, they may not all be visible on an ls command; they must be explicitly cd'ed into first (i.e., the directories don't mount until they're accessed). ls /export/home from the HN will always show all the users' home directories, similarly for ls /export/apps.

Worker nodes:

/dev/sda          76293 :
root/              8192 /sda1 ext3
swap               8192 /sda2 ext3
/var               4096 /sda3 ext3
/scratch          55813 /sda4 ext3

/dev/sdb         238418 :
/tmp             238418 /sdb1 ext3

  • Some documentation suggests that all non-root partitions on WNs will be preserved over subsequent reinstalls, but don't rely on it.
  • /scratch is meant for locally installed software.
  • /tmp is meant for temporary grid job output. It is used explicitly by CRAB CMSSW jobs and can be used as a temporary location for job output for interactive jobs or condor batch jobs. Output must be transferred by the user out of /tmp as soon as the job completes as this partition is regularly cleaned.

Big disk array:

The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes and has been documented to perform well with BeStMan. While we do not use BeStMan in a pure storage resource manager (SRM) capacity, the ability to do so later may become necessary as the size of the volume increases. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9.5TB in size.

 

Network

For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.

external IP : external hostname : internal IP : Rocks name
----------------------------------------------------------
128.8.164.11 : (switch)         : 10.255.255.254 : network-0-0
128.8.164.12 : HEPCMS-0.UMD.EDU : 10.1.1.1       : hepcms-0
128.8.164.13 : HEPCMS-1.UMD.EDU : 10.255.255.253 : compute-0-0
128.8.164.14 : HEPCMS-2.UMD.EDU : 10.255.255.252 : compute-0-1
128.8.164.15 : HEPCMS-3.UMD.EDU : 10.255.255.251 : compute-0-2
128.8.164.16 : HEPCMS-4.UMD.EDU : 10.255.255.250 : compute-0-3
128.8.164.17 : HEPCMS-5.UMD.EDU : 10.255.255.249 : compute-0-4
128.8.164.18 : HEPCMS-6.UMD.EDU : 10.255.255.248 : compute-0-5
128.8.164.19 : HEPCMS-7.UMD.EDU : 10.255.255.247 : compute-0-6
128.8.164.21 : unassigned
128.8.164.22 : unassigned

compute-0-7 is a special case. Initially it is configured as:

128.8.164.20 : HEPCMS-8.UMD.EDU : 10.255.255.246 : compute-0-7

It is later replaced by phedex-node-0-7:

128.8.164.20 : HEPCMS-8.UMD.EDU : 10.255.255.246 : phedex-node-0-7

internal network always on eth0
external network always on eth1

External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1

The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.