Configuration

NOTE: This is an archived version of the site configuration. It is no longer valid for our current setup, but is kept for reference purposes. Please access the new configuration here. This configuration is the one used for the archived version of the admin how-to guide.

The UMD HEP T3 cluster is composed of one HN and eight WNs. After RAID and formatting, users have ~400GB disk space for their /home area and ~10TB disk space for large datasets (/data). Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).

Last edited August 12, 2009

Node roles
Hardware
Partitions
Network

Node Roles

The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS to the WNs via Rocks Kickstart files and is also the OSG storage element (SE) and compute element (CE) HN. We have eight worker nodes (WNs), six of which are designated as user-interactive nodes. Of the remaining two WNs, one is the PhEDEx node and the other is the CRAB node, both are extensions of the basic worker node. All eight WNs are Condor processing nodes.

Head node:

external name: hepcms-0.umd.edu
internal name: hepcms-0

Rocks head
Job submission point to WNs via Condor (interactive users) and Condor_g (grid users)
Grid storage element (SE) & computing element (CE) head
Services SE requests with BeStMan
Stores users' /home area, which is visible to the WNs
Controls big disk via DAS cable and PERC6/E controller
Ganglia monitor and web server
Hosts most software, usable by the WNs
Provides internal network gateway
Squid web proxy for Frontier (CMSSW conditions database)

Having one node fulfill all three important roles of Rocks HN, OSG CE, and OSG SE is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster, where a loss of two additional nodes is a 25% loss of our computing power. Future upgrades, should they occur, will split these roles across multiple nodes.

Seven interactive worker nodes:

external name: hepcms.umd.edu points to hepcms-1.umd.edu - hepcms-7.umd.edu
internal names: compute-0-0 to compute-0-6

Services CE (Condor_g) jobs sent via the HN
Services Condor jobs sent via any node
Job submission point to WNs via Condor (interactive users) - this capability is currently disabled, but will be implemented
Runs user interactive jobs
Stores CMSSW temporary output in /tmp
Installs gLite-UI & CRAB in /scratch

One note of import is that gLite-UI is not compatible with services provided by the Rocks HN. So CRAB, based on gLite-UI, cannot be installed on the HN. However, CRAB does support job sumission using CrabServer, which does not require gLite-UI. It is unknown if CrabServer without gLite-UI can submit jobs to European sites. Some implementations of PhEDEx also run on top of gLite-UI, but use a different site configuration file. In this case, both CRAB with gLite-UI and PhEDEx with gLite-UI cannot run on the same node. Since our PhEDEx installation does not use gLite-UI, we can install gLite-UI as a part of normal WN Kickstart files, which the PhEDEx node inherits.

PhEDEx worker node:

external name: hepcms-8.umd.edu
internal name: phedex-node-0-7

The PhEDEx WN inherits all the functionality of the seven interactive WNs, with the additional feature:

PhEDEx installed and configured

There is no particular reason to put PhEDEx on more than one node. Some implementations of PhEDEx run atop gLite-UI, which is not compatible with the Rocks HN services. Thus, PhEDEx must be installed on one of the WNs. PhEDEx tends to hammer at the network resources of a node, which is why the PhEDEx node probably should not be used for user-interactive tasks. However, PhEDEx does not drastically consume any other resources local to the node, so there is no need to take the PhEDEx node out of the condor compute pool either. Thus, designating one WN as the PhEDEx node, which inherits all the functionality of normal WNs, makes sense. Users are not prohibited from logging in and running jobs interactively, but hepcms.umd.edu will not direct users to this node.

Hardware

HN: Dell PowerEdge 2950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
8GB 667MHz RAM
PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (users' area and applications), ~420GB
PERC6/E : controls all 15 physical disks of PowerVault MD1000 (big disk), configured as RAID-6, ~10 TB

WNs: Dell PowerEdge 1950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
16GB 667MHz RAM
80GB primary disk
250GB /tmp disk

PowerVault MD1000 (aka big disk)

DAS
15 750GB 7.2K RPM SATA 3Gbps hard drives
Controlled by PERC6/E controller in HN

PowerConnect 6224

Managed switch
Stacking capable
24 GbE ports

APS 2200 VA

120 Volt UPS
Network controllable (currently not configured)

Partitions

Head node:

/dev/sda 69374, RAID-1 67.75 GB physical disks 0:0:0, 0:0:1 :
root/      8189 /sda1 ext3
swap       8189 /sda2 swap
/var       4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch 48901 /sda5 ext3

/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/export 418168 /sdb1 ext3

/dev/sdc 9744877, RAID-6, 8.9 TB 15 physical disks
(Logical volume)
/data 9744877 /dev/mapper/datastore-cmsdata0 xfs

/scratch is meant primarily for CMSSW and is network mounted on all nodes as /software (CMSSW is in /software/cmssw). Other software can be installed here, but will reduce the space available to CMSSW. To install other software here (discouraged, use /export/apps if you need a network mount), install it in the directory /software/other.
Squid is installed to /scratch/squid, which is not network mounted (Squid doesn't like network mounts or RAID-5). Squid can use up to 5GB on /scratch. Squid is needed for contacting the Frontier CMS conditions database, which is a part of CMSSW. With ~50 GB in /scratch, ~5 GB/CMSSW release and 5 GB for Squid, /scratch can support ~9 CMSSW releases. We may reduce the size allocated to Squid at a later time; however 5 GB is already a quarter of the minimum recommended.
/export contains the users' home area as well as much of the software (besides CMSSW)
We were not able to find explicit details on how /export is handled on subsequent Rocks upgrades, but we believe that /export is preserved between reinstalls.
Applications install in /export/apps on the HN and auto-network-mount as /share/apps on the WNs and the HN.
The users' home area is in /export/home and network mounts as /home on both the HN and WNs.
Because the /home/user and /share/apps sub-directories are auto-network-mounted, they may not all be visible on an ls command; they must be explicitly cd'ed into first (i.e., the directories don't mount until they're accessed). ls /export/home from the HN will always show all the users' home directories, similarly for ls /export/apps.

Worker nodes:

/dev/sda   76293 :
root/      8192 /sda1 ext3
swap      8192 /sda2 ext3
/var       4096 /sda3 ext3
/scratch 55813 /sda4 ext3

/dev/sdb 238418 :
/tmp 238418 /sdb1 ext3

Some documentation suggests that all non-root partitions on WNs will be preserved over subsequent reinstalls, but don't rely on it.
/scratch is meant for locally installed software.
/tmp is meant for temporary grid job output. It is used explicitly by CRAB CMSSW jobs and can be used as a temporary location for job output for interactive jobs or condor batch jobs. Output must be transferred by the user out of /tmp as soon as the job completes as this partition is regularly cleaned.

Big disk array:

The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes and has been documented to perform well with BeStMan. While we do not use BeStMan in a pure storage resource manager (SRM) capacity, the ability to do so later may become necessary as the size of the volume increases. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9.5TB in size.

Network

For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.

external IP : external hostname : internal IP : Rocks name
----------------------------------------------------------
128.8.164.11 : (switch) : 10.255.255.254 : network-0-0
128.8.164.12 : HEPCMS-0.UMD.EDU : 10.1.1.1 : hepcms-0
128.8.164.13 : HEPCMS-1.UMD.EDU : 10.255.255.253 : compute-0-0
128.8.164.14 : HEPCMS-2.UMD.EDU : 10.255.255.252 : compute-0-1
128.8.164.15 : HEPCMS-3.UMD.EDU : 10.255.255.251 : compute-0-2
128.8.164.16 : HEPCMS-4.UMD.EDU : 10.255.255.250 : compute-0-3
128.8.164.17 : HEPCMS-5.UMD.EDU : 10.255.255.249 : compute-0-4
128.8.164.18 : HEPCMS-6.UMD.EDU : 10.255.255.248 : compute-0-5
128.8.164.19 : HEPCMS-7.UMD.EDU : 10.255.255.247 : compute-0-6
128.8.164.21 : unassigned
128.8.164.22 : unassigned

compute-0-7 is a special case. Initially it is configured as:

128.8.164.20 : HEPCMS-8.UMD.EDU : 10.255.255.246 : compute-0-7

It is later replaced by phedex-node-0-7:

128.8.164.20 : HEPCMS-8.UMD.EDU : 10.255.255.246 : phedex-node-0-7

internal network always on eth0
external network always on eth1

External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1

The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.

UMD HEP T3 Computing Cluster

Configuration

Table of Contents

Node Roles

Head node:

Seven interactive worker nodes:

PhEDEx worker node:

Hardware

Partitions

Head node:

Worker nodes:

Big disk array:

Network