How To: Guides for users and Maryland T3 admins.

Help: Links and emails for further info.

Configuration: technical layout of the cluster, primarily for admins.

Log: Has been moved to a Google page, accessible only to admins.

Configuration

Guide is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation.

The UMD HEP T3 cluster is composed of one head node (HN), one grid node (GN), two interactive nodes (INs), and eight worker nodes (WNs). After RAID and formatting, we have ~9TB disk space for interactive use, ~400GB for network mounted software such as CMSSW, and ~400GB disk space for users' network mounted /home. With Hadoop, we have ~86TB space for storage element (SE) hosted datasets. Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).

Last edited September 10, 2015

Table of Contents

 


 

Node Roles

The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS and basic configuration to all other nodes via Rocks Kickstart files, as well as running the Squid web proxy for accessing CMSSW's Frontier database. The grid node (GN) runs the OSG computing element (CE), storage element (SE), PhEDEx, and CMSSW. Users login to and run interactive jobs on the two interactive nodes (INs), which have locally installed gLite-UI & CRAB software. The eight worker nodes (WNs) are members of the condor pool and service batch jobs submitted either by local users or grid users within our supported VOs (primarily CMS). The whole cluster is contained in the space of one rack.

Head node:

external name: hepcms-hn.umd.edu
internal name: HEPCMS-0 (for historical reasons)

Grid node:

external name: hepcms-0.umd.edu
internal name: grid-0-0

Having one node fulfill the four important roles of CE, SE, PhEDEx service, and CMSSW network mount is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster.

Some implementations of PhEDEx run atop gLite-UI, which may cause problems with the Rocks frontend, OSG CE or SE. Additionally, some CRAB installations (such as ours) can run atop gLite-UI, which may need to be configured differently for CRAB vs. PhEDEx. Our PhEDEx installation uses simple srm commands instead of the specialized file transfer service (FTS), which requires gLite-UI. A PhEDEx installation which uses gLite-UI should not be on the OSC CE or SE, a Rocks frontend, or on a node with gLite-UI configured for CRAB.

Storage Element Node:

internal name: SE-0-1

Two interactive nodes:

external names: hepcms.umd.edu points to hepcms-in1.umd.edu & hepcms-in2.umd.edu
internal names: interactive-0-0 & interactive-0-1

One note of import is that gLite-UI does not do well on a Rocks frontend (some tarball installations of gLite-UI seem better behaved). So our CRAB, based on gLite-UI, cannot be installed on the HN, nor on the GN for similar problems with the OSG CE & SE. However, CRAB does support job sumission to European sites using Condor GlideIn to some CrabServers, which does not require gLite-UI.

Fifteen worker nodes:

Not externally accessible
internal names (some numbers missing): compute-0-1 -> compute-0-14, R510-0-1 --> R510-0-9

 


 

Hardware

HN: Dell PowerEdge 2950

GN: Dell PowerEdge 2950

SE: Dell PowerEdge R410

INs: Dell PowerEdge 1950

WNs

compute nodes: Dell PowerEdge 1950

R510 compute nodes: Dell PowerEdge R510

PowerVault MD1000 (aka big disk)

PowerConnect 6248

APS 2200 VA

Two PowerEdge 2160AS KVM switches

 


 

Some Partition Information

Head node:

Grid node:

Interactive nodes:

Worker nodes:

Big disk array:

The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9TB in size.

Hadoop:

Data nodes for Hadoop are the Worker Nodes. Hadoop files are replicated across nodes, so the system is self-correcting if one node is down. If two nodes are down, there may be problems. Hadoop is implemented for primary usage through OSG Storage Element (SE) read/write. Currently the Hadoop volume is 86TB in size. Hadoop uses replication so df -h shows 173TB. It is best to keep the Hadoop volume usage below 86% usage (74GB and below) in case of failure (this level allows for one R510 node to be down and the system to be healthy).

 

 


 

Network

For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.

external IP  : external hostname  : internal IP    : Rocks name 
--------------------------------------------------------------------
    N/A      :   N/A (switch)     : 10.1.255.254 : network-0-0
128.8.164.11 : hepcms-hn.umd.edu  : 10.1.1.1     : hepcms-hn 
128.8.164.12 : hepcms-0.umd.edu   : 10.1.255.253 : grid-0-0 
    N/A      :       N/A          : 10.1.255.238 : SE-0-1
    N/A      :       N/A          : 10.1.255.251 : compute-0-1 
    N/A      :       N/A          : 10.1.255.248 : compute-0-4 
    N/A      :       N/A          : 10.1.255.247 : compute-0-5 
    N/A      :       N/A          : 10.1.255.246 : compute-0-6 
    N/A      :       N/A          : 10.1.255.245 : compute-0-7 
    N/A      :       N/A          : 10.1.255.249 : compute-0-9 --> DOWN
    N/A      :       N/A          : 10.1.255.236 : compute-0-11
    N/A      :       N/A          : 10.1.255.235 : compute-0-14     
128.8.164.21 : hepcms-in1.umd.edu : 10.1.255.239 : interactive-0-0 
128.8.164.22 : hepcms-in2.umd.edu : 10.1.255.237 : interactive-0-1
    N/A      :       N/A          : 10.1.255.243 : R510-0-1 
    N/A      :       N/A          : 10.1.255.244 : R510-0-2 
    N/A      :       N/A          : 10.1.255.241 : R510-0-4      
    N/A      :       N/A          : 10.1.255.240 : R510-0-6 
    N/A      :       N/A          : 10.1.255.250 : R510-0-8
    N/A      :       N/A          : 10.1.255.252 : R510-0-9 
    N/A      :       N/A          : 10.1.255.228 : R510-0-17
    

internal network always on eth0
external network always on eth1
except for R510 nodes which are channel bonded to share eth0 and eth1 on the internal network

External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1

The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.