Configuration
Guide is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation.
The UMD HEP T3 cluster is composed of one head node (HN), one grid node (GN), two interactive nodes (INs), and eight worker nodes (WNs). After RAID and formatting, we have ~9TB disk space for interactive use, ~400GB for network mounted software such as CMSSW, and ~400GB disk space for users' network mounted /home. With Hadoop, we have ~86TB space for storage element (SE) hosted datasets. Our cluster is managed by Rocks and is designed to have full T3 capability, including a storage element. It is on the Open Science Grid (OSG) and affiliated with the CMS virtual organization (VO).
Last edited September 10, 2015
Table of Contents
Node Roles
The OSG Site Planning guide played an important role in the design of our cluster. Our head node (HN) distributes the OS and basic configuration to all other nodes via Rocks Kickstart files, as well as running the Squid web proxy for accessing CMSSW's Frontier database. The grid node (GN) runs the OSG computing element (CE), storage element (SE), PhEDEx, and CMSSW. Users login to and run interactive jobs on the two interactive nodes (INs), which have locally installed gLite-UI & CRAB software. The eight worker nodes (WNs) are members of the condor pool and service batch jobs submitted either by local users or grid users within our supported VOs (primarily CMS). The whole cluster is contained in the space of one rack.
Head node:
external name: hepcms-hn.umd.edu
internal name: HEPCMS-0 (for historical reasons)
- Rocks head
- Condor pool manager
- Stores users' /home area, which is network mounted
- Ganglia monitor and web server
- Provides internal network gateway
- Squid web proxy for Frontier (CMSSW conditions database)
Grid node:
external name: hepcms-0.umd.edu
internal name: grid-0-0
- Job submission point to WNs for condor grid jobs
- Grid storage element (SE) & computing element (CE)
- Services SE requests with BeStMan-Gateway
- Hosts network-mounted OSG worker node client
- Controls big disk via DAS cable and PERC6/E controller
- Hosts network-mounted CMSSW
- Runs PhEDEx
Having one node fulfill the four important roles of CE, SE, PhEDEx service, and CMSSW network mount is not a scalable solution. We do this because splitting the roles is not practical on such a small cluster.
Some implementations of PhEDEx run atop gLite-UI, which may cause problems with the Rocks frontend, OSG CE or SE. Additionally, some CRAB installations (such as ours) can run atop gLite-UI, which may need to be configured differently for CRAB vs. PhEDEx. Our PhEDEx installation uses simple srm commands instead of the specialized file transfer service (FTS), which requires gLite-UI. A PhEDEx installation which uses gLite-UI should not be on the OSC CE or SE, a Rocks frontend, or on a node with gLite-UI configured for CRAB.
Storage Element Node:
internal name: SE-0-1
- Primary NameNode for Hadoop distributed disk storage
Two interactive nodes:
external names: hepcms.umd.edu points to hepcms-in1.umd.edu & hepcms-in2.umd.edu
internal names: interactive-0-0 & interactive-0-1
- Job submission point to WNs via Condor (interactive users)
- Installs gLite-UI & CRAB in /scratch
- Runs user interactive jobs
- Secondary NameNode for Hadoop
One note of import is that gLite-UI does not do well on a Rocks frontend (some tarball installations of gLite-UI seem better behaved). So our CRAB, based on gLite-UI, cannot be installed on the HN, nor on the GN for similar problems with the OSG CE & SE. However, CRAB does support job sumission to European sites using Condor GlideIn to some CrabServers, which does not require gLite-UI.
Fifteen worker nodes:
Not externally accessible
internal names (some numbers missing): compute-0-1 -> compute-0-14, R510-0-1 --> R510-0-9
- Service CE (Condor) jobs sent via the GN
- Service interactive (Condor) jobs sent via any the INs
- Stores CMSSW temporary output in /tmp
- Uses the network-mounted OSG WN client for binaries and configuration needed by grid jobs
- Part of the disk pool for Hadoop, hosted by the SE
Hardware
HN: Dell PowerEdge 2950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 8GB 667MHz RAM
- PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (users' area and applications), ~420GB
- PERC6/E : currently unused
GN: Dell PowerEdge 2950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 8GB 667MHz RAM
- PERC6/I : controls physical disks 0 & 1 using RAID-1 (OS), ~70 GB; physical disks 2-5 using RAID-5 (CMSSW & OSG software network mounts), ~420GB
- PERC6/E : controls all 15 physical disks of PowerVault MD1000 (big disk), configured as RAID-6, ~9 TB
SE: Dell PowerEdge R410
- Two 6-core Xeon X650 Processors 12MB Cache, 2.66GHz, 1333MHz FSB
- 24GB 1333MHz RAM
- 20GB /tmp disk
INs: Dell PowerEdge 1950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.66GHz, 1333MHz FSB
- 16GB 667MHz RAM
- 146GB primary disk
- 146GB /tmp disk
WNs
compute nodes: Dell PowerEdge 1950
- Two quad core Xeon E5440 Processors 2x6MB Cache, 2.83GHz, 1333MHz FSB
- 16GB 667MHz RAM
- 80GB primary disk
- 250GB /tmp disk
R510 compute nodes: Dell PowerEdge R510
- Two 6-core Xeon X650 Processors 12MB Cache, 2.66GHz, 1333MHz FSB
- hyperthreaded to provide 24 cores per node
- 48GB 1333MHz RAM
- 146GB primary disk
- 177GB /tmp disk
PowerVault MD1000 (aka big disk)
- DAS
- 15 750GB 7.2K RPM SATA 3Gbps hard drives
- Controlled by PERC6/E controller in HN
PowerConnect 6248
- Managed switch
- Stacking capable
- 48 GbE ports
APS 2200 VA
- 120 Volt UPS
- Network controllable (currently not configured)
- Powers the two 2950s (HN, GN), the PowerVault, the two 1950 INs, one R510, and the switch
Two PowerEdge 2160AS KVM switches
- 16 KVM ports via CAT5 cables (requires Dell server interface pod)
- 2 physical KVM control ports, one connected to rack KVM
Some Partition Information
Head node:
- Squid is installed in /scratch/squid, which is not network mounted (Squid doesn't like network mounts or RAID-5). Squid is needed for contacting the Frontier CMS conditions database, which is a part of CMSSW.
- /export contains the users' network mounted /home area as well as the network mounted /share/apps area (Rocks default). /home/install is also used by Rocks as the OS & kickstart distribution point.
- We were not able to find explicit details on how /export is handled on subsequent Rocks upgrades, but we believe that /export is preserved between reinstalls.
- Because the /home/user and /share/apps sub-directories are auto-network-mounted, they may not all be visible on an ls command; they must be explicitly cd'ed into first (i.e., the directories don't mount until they're accessed). ls /export/home from the HN will always show all the users' home directories, similarly for ls /export/apps.
Grid node:
- /scratch is network mounted as /sharesoft
- OSG is installed in /scratch/osg (this location will change when we switch to OSG RPM installations in the future)
- CMSSW is installed in /scratch/cmssw
- PhEDEx is installed in /localsoft/phedex
- /scratch and /localsoft are preserved across Rocks kickstarts, but will be formatted when the partition table changes
Interactive nodes:
- CRAB is installed in /scratch/crab
- gLite-UI is installed in /scratch/gLite
- /scratch is preserved across Rocks kickstarts, but will be formatted when the partition table changes
Worker nodes:
- /scratch is meant for locally installed WN software, currently unused
- /tmp is meant for temporary grid job output. It is used explicitly by CRAB CMSSW jobs and can be used as a temporary location for job output for interactive jobs or condor batch jobs. Output must be transferred by the user out of /tmp as soon as the job completes as this partition is regularly cleaned.
Big disk array:
The entire disk array is treated as a single drive in the OS. We use RAID-6 so single disk failure will not result in a significant performance loss and so our data survives dual disk failure. This disk is treated as a logical volume in the OS. Our disk array allows connections to up to two additional arrays in a daisy-chain. By doing an LVM, we can install additional arrays and merely extend the LVM over the new available space. We use the XFS formatting system, which is designed to handle large disk volumes. The disk array, at the present time, is managed by the OS and is network mounted as /data on all nodes. This makes the array much more accessible to users, but is not a scalable solution. After RAID-6 and formatting, our disk array is roughly 9TB in size.
Hadoop:
Data nodes for Hadoop are the Worker Nodes. Hadoop files are replicated across nodes, so the system is self-correcting if one node is down. If two nodes are down, there may be problems. Hadoop is implemented for primary usage through OSG Storage Element (SE) read/write. Currently the Hadoop volume is 86TB in size. Hadoop uses replication so df -h shows 173TB. It is best to keep the Hadoop volume usage below 86% usage (74GB and below) in case of failure (this level allows for one R510 node to be down and the system to be healthy).
Network
For security purposes, port information is not listed here. It can be read (by the root user only) in the file ~root/network-ports.txt on the HN.
external IP : external hostname : internal IP : Rocks name -------------------------------------------------------------------- N/A : N/A (switch) : 10.1.255.254 : network-0-0 128.8.164.11 : hepcms-hn.umd.edu : 10.1.1.1 : hepcms-hn 128.8.164.12 : hepcms-0.umd.edu : 10.1.255.253 : grid-0-0 N/A : N/A : 10.1.255.238 : SE-0-1 N/A : N/A : 10.1.255.251 : compute-0-1 N/A : N/A : 10.1.255.248 : compute-0-4 N/A : N/A : 10.1.255.247 : compute-0-5 N/A : N/A : 10.1.255.246 : compute-0-6 N/A : N/A : 10.1.255.245 : compute-0-7 N/A : N/A : 10.1.255.249 : compute-0-9 --> DOWN N/A : N/A : 10.1.255.236 : compute-0-11 N/A : N/A : 10.1.255.235 : compute-0-14 128.8.164.21 : hepcms-in1.umd.edu : 10.1.255.239 : interactive-0-0 128.8.164.22 : hepcms-in2.umd.edu : 10.1.255.237 : interactive-0-1 N/A : N/A : 10.1.255.243 : R510-0-1 N/A : N/A : 10.1.255.244 : R510-0-2 N/A : N/A : 10.1.255.241 : R510-0-4 N/A : N/A : 10.1.255.240 : R510-0-6 N/A : N/A : 10.1.255.250 : R510-0-8 N/A : N/A : 10.1.255.252 : R510-0-9 N/A : N/A : 10.1.255.228 : R510-0-17
internal network always on eth0
external network always on eth1
except for R510 nodes which are channel bonded to share eth0 and eth1 on the internal network
External Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.0
Netmask for internal network (on HN): 255.0.0.0
DNS for external internet: 128.8.74.2, 128.8.76.2
DNS for internal network (on HN): 10.1.1.1
The command 'dbreport dhcpd' issued from the HN can provide much of this information, including MAC addresses.