Configuration

Guide is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation.

The UMD HEP T3 cluster is composed of one head node (HN), one grid node (GN), two interactive nodes (INs), and eight worker nodes (WNs). After RAID and formatting, we have ~9TB disk space for interactive use, ~400GB for network mounted software such as CMSSW, and ~400GB disk space for users' network mounted /home. With Hadoop, we have ~86TB space for storage element (SE) hosted datasets. Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).

Last edited September 10, 2015

Node roles
Hardware
Partitions
Network

Node Roles

The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS and basic configuration to all other nodes via Rocks Kickstart files, as well as running the Squid web proxy for accessing CMSSW's Frontier database. The grid node (GN) runs the OSG computing element (CE), storage element (SE), PhEDEx, and CMSSW. Users login to and run interactive jobs on the two interactive nodes (INs), which have locally installed gLite-UI & CRAB software. The eight worker nodes (WNs) are members of the condor pool and service batch jobs submitted either by local users or grid users within our supported VOs (primarily CMS). The whole cluster is contained in the space of one rack.

Head node:

external name: hepcms-hn.umd.edu
internal name: HEPCMS-0 (for historical reasons)

Rocks head
Condor pool manager
Stores users' /home area, which is network mounted
Ganglia monitor and web server
Provides internal network gateway
Squid web proxy for Frontier (CMSSW conditions database)

Grid node:

external name: hepcms-0.umd.edu
internal name: grid-0-0

Job submission point to WNs for condor grid jobs
Grid storage element (SE) & computing element (CE)
Services SE requests with BeStMan-Gateway
Hosts network-mounted OSG worker node client
Controls big disk via DAS cable and PERC6/E controller
Hosts network-mounted CMSSW
Runs PhEDEx

Having one node fulfill the four important roles of CE, SE, PhEDEx service, and CMSSW network mount is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster.

Some implementations of PhEDEx run atop gLite-UI, which may cause problems with the Rocks frontend, OSG CE or SE. Additionally, some CRAB installations (such as ours) can run atop gLite-UI, which may need to be configured differently for CRAB vs. PhEDEx. Our PhEDEx installation uses simple srm commands instead of the specialized file transfer service (FTS), which requires gLite-UI. A PhEDEx installation which uses gLite-UI should not be on the OSC CE or SE, a Rocks frontend, or on a node with gLite-UI configured for CRAB.

Storage Element Node:

internal name: SE-0-1

Primary NameNode for Hadoop distributed disk storage

Two interactive nodes:

external names: hepcms.umd.edu points to hepcms-in1.umd.edu & hepcms-in2.umd.edu
internal names: interactive-0-0 & interactive-0-1

Job submission point to WNs via Condor (interactive users)
Installs gLite-UI & CRAB in /scratch
Runs user interactive jobs
Secondary NameNode for Hadoop

One note of import is that gLite-UI does not do well on a Rocks frontend (some tarball installations of gLite-UI seem better behaved). So our CRAB, based on gLite-UI, cannot be installed on the HN, nor on the GN for similar problems with the OSG CE & SE. However, CRAB does support job sumission to European sites using Condor GlideIn to some CrabServers, which does not require gLite-UI.

Fifteen worker nodes:

Not externally accessible
internal names (some numbers missing): compute-0-1 -> compute-0-14, R510-0-1 --> R510-0-9

Service CE (Condor) jobs sent via the GN
Service interactive (Condor) jobs sent via any the INs
Stores CMSSW temporary output in /tmp
Uses the network-mounted OSG WN client for binaries and configuration needed by grid jobs
Part of the disk pool for Hadoop, hosted by the SE

Hardware

HN: Dell PowerEdge 2950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
8GB 667MHz RAM
PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (users' area and applications), ~420GB
PERC6/E : currently unused

GN: Dell PowerEdge 2950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
8GB 667MHz RAM
PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (CMSSW & OSG software network mounts), ~420GB
PERC6/E : controls all 15 physical disks of PowerVault MD1000 (big disk), configured as RAID-6, ~9 TB

SE: Dell PowerEdge R410

Two 6-core Xeon X650 Processors 12MB Cache, 2.66GHz, 1333MHz FSB
24GB 1333MHz RAM
20GB /tmp disk

INs: Dell PowerEdge 1950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.66GHz, 1333MHz FSB
16GB 667MHz RAM
146GB primary disk
146GB /tmp disk

WNs

compute nodes: Dell PowerEdge 1950

Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
16GB 667MHz RAM
80GB primary disk
250GB /tmp disk

R510 compute nodes: Dell PowerEdge R510

Two 6-core Xeon X650 Processors 12MB Cache, 2.66GHz, 1333MHz FSB
- hyperthreaded to provide 24 cores per node
48GB 1333MHz RAM
146GB primary disk
177GB /tmp disk

PowerVault MD1000 (aka big disk)

DAS
15 750GB 7.2K RPM SATA 3Gbps hard drives
Controlled by PERC6/E controller in HN

PowerConnect 6248

Managed switch
Stacking capable
48 GbE ports

APS 2200 VA

120 Volt UPS
Network controllable (currently not configured)
Powers the two 2950s (HN, GN), the PowerVault, the two 1950 INs, one R510, and the switch

Two PowerEdge 2160AS KVM switches

16 KVM ports via CAT5 cables (requires Dell server interface pod)
2 physical KVM control ports, one connected to rack KVM

Some Partition Information

Head node:

Squid is installed in /scratch/squid, which is not network mounted (Squid doesn't like network mounts or RAID-5). Squid is needed for contacting the Frontier CMS conditions database, which is a part of CMSSW.
/export contains the users' network mounted /home area as well as the network mounted /share/apps area (Rocks default). /home/install is also used by Rocks as the OS & kickstart distribution point.
We were not able to find explicit details on how /export is handled on subsequent Rocks upgrades, but we believe that /export is preserved between reinstalls.
Because the /home/user and /share/apps sub-directories are auto-network-mounted, they may not all be visible on an ls command; they must be explicitly cd'ed into first (i.e., the directories don't mount until they're accessed). ls /export/home from the HN will always show all the users' home directories, similarly for ls /export/apps.

Grid node:

/scratch is network mounted as /sharesoft
OSG is installed in /scratch/osg (this location will change when we switch to OSG RPM installations in the future)
CMSSW is installed in /scratch/cmssw
PhEDEx is installed in /localsoft/phedex
/scratch and /localsoft are preserved across Rocks kickstarts, but will be formatted when the partition table changes

Interactive nodes:

CRAB is installed in /scratch/crab
gLite-UI is installed in /scratch/gLite
/scratch is preserved across Rocks kickstarts, but will be formatted when the partition table changes

Worker nodes:

/scratch is meant for locally installed WN software, currently unused
/tmp is meant for temporary grid job output. It is used explicitly by CRAB CMSSW jobs and can be used as a temporary location for job output for interactive jobs or condor batch jobs. Output must be transferred by the user out of /tmp as soon as the job completes as this partition is regularly cleaned.

Big disk array:

The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9TB in size.

Hadoop:

Data nodes for Hadoop are the Worker Nodes. Hadoop files are replicated across nodes, so the system is self-correcting if one node is down. If two nodes are down, there may be problems. Hadoop is implemented for primary usage through OSG Storage Element (SE) read/write. Currently the Hadoop volume is 86TB in size. Hadoop uses replication so df -h shows 173TB. It is best to keep the Hadoop volume usage below 86% usage (74GB and below) in case of failure (this level allows for one R510 node to be down and the system to be healthy).

Network

For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.

external IP  : external hostname  : internal IP    : Rocks name 
--------------------------------------------------------------------
    N/A      :   N/A (switch)     : 10.1.255.254 : network-0-0
128.8.164.11 : hepcms-hn.umd.edu  : 10.1.1.1     : hepcms-hn 
128.8.164.12 : hepcms-0.umd.edu   : 10.1.255.253 : grid-0-0 
    N/A      :       N/A          : 10.1.255.238 : SE-0-1
    N/A      :       N/A          : 10.1.255.251 : compute-0-1 
    N/A      :       N/A          : 10.1.255.248 : compute-0-4 
    N/A      :       N/A          : 10.1.255.247 : compute-0-5 
    N/A      :       N/A          : 10.1.255.246 : compute-0-6 
    N/A      :       N/A          : 10.1.255.245 : compute-0-7 
    N/A      :       N/A          : 10.1.255.249 : compute-0-9 --> DOWN
    N/A      :       N/A          : 10.1.255.236 : compute-0-11
    N/A      :       N/A          : 10.1.255.235 : compute-0-14     
128.8.164.21 : hepcms-in1.umd.edu : 10.1.255.239 : interactive-0-0 
128.8.164.22 : hepcms-in2.umd.edu : 10.1.255.237 : interactive-0-1
    N/A      :       N/A          : 10.1.255.243 : R510-0-1 
    N/A      :       N/A          : 10.1.255.244 : R510-0-2 
    N/A      :       N/A          : 10.1.255.241 : R510-0-4      
    N/A      :       N/A          : 10.1.255.240 : R510-0-6 
    N/A      :       N/A          : 10.1.255.250 : R510-0-8
    N/A      :       N/A          : 10.1.255.252 : R510-0-9 
    N/A      :       N/A          : 10.1.255.228 : R510-0-17

internal network always on eth0
external network always on eth1
except for R510 nodes which are channel bonded to share eth0 and eth1 on the internal network

External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1

The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.

UMD HEP T3 Computing Cluster

Configuration

Table of Contents

Node Roles

Head node:

Grid node:

Storage Element Node:

Two interactive nodes:

Fifteen worker nodes:

Hardware

Some Partition Information

Head node:

Grid node:

Interactive nodes:

Worker nodes:

Big disk array:

Hadoop:

Network