Admin How-To Guide

This guide is now deprecated. As of Dec. 31, 2009, a newer guide for SL5 is now available. This page will be moved to our archives soon, so please update your links.

This guide is meant for UMD admins primarily and as a documented single use case of a USCMS Tier 3 site. It is based off our cluster configuration and hardware, documented here.

We do not call out where you might need to change your syntax, so if you are a non-UMD admin, we recommend you follow the guides linked at the beginning of each set of instructions and reference this guide to see what choices we made. We have not necessarily followed "best setup" practices and you use this to setup or modify your cluster at your own risk.

Dependencies are listed at the beginning of every set of instructions. Instructions are ideally (but not in practice) updated every time we install a new software release. If any links have expired, any errors are found, or some points are unclear, please notify the System administrators.

In mid-August 2009, we performed numerous software updates as well as changing our hardware configuration. We have kept an archived copy of the old admin guide as well as the old hardware configuration.

Last edited November 8, 2012

Connect to the switch
- Direct serial connection
- Over the internet
Install Rocks 4.3 with SL4.5
Modify Rocks
Upgrade RAID firmware & drivers
Configure the big disk array
- Create, format & mount the disk array on the HN
- Network mount the disk array on all the nodes
Instrument & monitor
Install CMSSW
Install CRAB
Install OSG
Install PhEDEx
Install/configure other software
- RPMforge
- xemacs/emacs
- g++ & g77
- LaTeX
- Pacman
- Kerberos
- CVS
- Subversion
- cron garbage collection
- Condor
Backup critical files
Recover from HN failure
Solutions to encountered errors
- RAID
- Rocks
- Condor
- Logical volume (LVM)
- CMSSW
- gcc/g++/gcc4/g++4
- Dell OpenManage
- YUM
- gLite
- srm
- OSG/RSV
- SiteDB/PhEDEx

Connect to the switch

This is a guide intended for basic setup in a Rocks cluster. The Dell 6224 (a rebranded Cisco) is a fully managed switch, meant for use in a larger switching fabric, so it has many powerful features (most of which will not be covered here). The specific configuration details may vary, depending on your local environment. If in doubt, please consult your local network administrator. We first connect to the switch via a direct serial connection to get it to issue DHCP requests. We then get Rocks to listen to the DHCP request and assign an IP address, then do final configuration via a web browser.

In addition to the information we provide, all of the Dell 6224 manuals can be downloaded here.

Direct serial connection
Using a graphical browser

Direct serial connection

1. The VT100 emulator:

First, connect the switch and headnode (or computer of choice), using the manufacturer supplied serial cable. A terminal program, such as 'minicom' (available in most Linux distros) can be used to talk to the switch. It must be noted here, that we were unable to get our headnode to communicate with the switch over the serial console using minicom, so instead a laptop w/ serial port running Linux was used (this is a local anomaly, and should not be considered a default).

Alternative terminal programs for serial console:

Windows = Hyperterminal (available in all distributions)
Linux w/ GUI = gtkterm (available in most distros (except SL); if not, it is easily found)

2. Settings for serial console:

The most common configuration for asynchronous mode are used: 8-N-1.

8 = 8 data bits
N = no parity bits
1= 1 stop bit

Most console programs will default to these settings. Additionally, the communication speed should be set to at least 9600 baud.

3. Initial setup:

Power on the switch and wait for startup to complete. The Easy Setup Wizard will display on an unconfigured switch. These are the important points:

Would you like to set up the SNMP management interface now? [Y/N] N
Choose no. (unless you have centralized Dell OpenManage, or other management)
To set up a user account: The default account is 'admin', but anything may be used.
To set up an IP address: Choose 'DHCP', as Rocks will handle address assignments in the cluster.
Select 'Y' to save the config and restart.

We also experimented with dividing certain types of traffic into separate VLANs. It was deemed unnecessary, given the present size of our cluster, but may be revisited should we add considerably more nodes, or if network traffic control proves problematic.

4. Network connections:

Now get Rocks to recognize the DHCP request issued by the switch by proceeding with step 9 of the Rocks installation instructions. In short, after Rocks has been installed on the HN:

insert-ethers
Select 'Ethernet switches'
Wait at least 30 mins after powering the switch for it to issue the DHCP request

After Rocks assigns an IP to the switch, it can be configured over telnet, SSH, and HTTP, from the headnode. The default name for the switch is network-0-0.

Using a graphical browser:

As outlined in step 9 of the Rocks installation instructions, the Spanning Tree Protocol (STP) must be disabled. It is often recommended to configure STP, which we did initially. We could not get the worker nodes to pull an address from DHCP. After some experimentation, all ports on the switch were set to 'portfast' mode, which solved the problem. However, this is essentially the same as turning STP off completely, which also works just fine. The problem is that links will go up and down a few times during the DHCP request, and STP won't properly activate a port until it has been up for several seconds. So, Rocks would never see the end nodes. This can be done from the command line, but it is simpler to use the web-enabled interface from a browser on the headnode (or over x-forwarding from the command line).

From the head node, open a graphical browser and enter the IP address: 10.255.255.254. The user name and password can be given by the System administrators. This is a semi-dynamically allocated IP, so in rare cases, the IP may be re-assigned. If this IP does not connect you to the switch, issue the command 'dbreport dhcpd' and look for the network-0-0.local bracket, where the local IP address will be listed. If the network-0-0.local bracket does not exist, a portion of the Rocks install must be redone (see "Install Rocks" below, instruction 9). Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.

If, for some reason, the browser method doesn't work, type these commands at the VT100 console provided by minicom or similar software:

console#config
console(config)#spanning-tree disable <port number>
(this will have to be done for all 24 ports!)
console(config)#exit
console#show spanning-tree
Spanning tree Disabled mode rstp
console#quit

Install Rocks

These instructions are for installing Rocks 4.3 using Scientific Linux 4.5, x86_64 architecture, adding the condor roll. Rocks downloads are available here, SL4.5 is available here. The Rocks 4.3 user's guide is available here.

Download the Rocks Kernel/Boot roll
Download the Rocks Core roll (includes the required rolls of base, hpc, and web-server, with a few nice extras)
Download the Rocks Condor roll
Download Scientific Linux 4.5 (all disks)
Download a special SL4.5 patch for Rocks, labeled the comps roll.
Burn all the .iso's to disks (Windows .iso burner, BurnCDCC).
Follow the Rocks 4.3 user's guide to install Rocks on the head node. Additions to the guide:
1. Our network configuration is detailed here. The initial boot phase is on a timer and will terminate if you do not enter the network information quickly enough.
2. We selected the base, ganglia, hpc, java, and web-server rolls from the Core CD. We believe the grid roll may actually be counter-productive, as it attempts to set up the cluster as a certificate authority, which may interfere with the OSG configuration.
3. Be sure to add the kernel, comps and condor rolls.
4. Insert each SL4.5 disk in turn and select the LTS roll listed.
5. As far as we know, the questions about certificate information on the "Cluster Information" screen is not used by any applications that we install. We entered the following, which may or may not be correct:
  FQHN: HEPCMS-0.UMD.EDU (originally, now hepcms-hn.umd.edu)
  Name: UMD HEP CMS T3
  Certificate Organization: DOEgrids
  Certificate Locality: College Park
  Certificate State: Maryland
  Certificate Country: US
  Contact: mtonjes@nospam.umd.edu (w/o the nospam)
  URL: http://hep-t3.physics.umd.edu
  Latitude/Longitude: 38.98N -76.92W
6. Select manual partitioning and allocate the following partition table (if you wish to preserve existing data, be sure to restore the partition table and don't modify any you wish to keep):
  
  /dev/sda :
  /          8189 /sda1 ext3
  swap       8189 /sda2 swap
  /var       4095 /sda3 ext3
  /sda4 is the extended partition which includes /sda5
  /scratch 48901 /sda5 ext3 (fill to max available size)
  
  /dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
  /export  418168 /sdb1 ext3 (fill to max available size)
  
  Leave /dev/sdc (the big disk array) alone as it is a logical volume and Rocks cannot handle logical volumes at the install stage.
7. In some cases, Rocks does not properly eject the boot disk before restarting. Be sure to eject the disk after Rocks is done installing, but before the reboot sequence completes and goes to the CD boot.
8. On your first login to the HN, you will be prompted to generate rsa keys. You should do so (the default file is fine, as well as using the same password).
Read the Rocks 4.3 user's guide on how to change the partition tables on the worker nodes. Note the code below may not work if you have existing partitions on any of the WNs. Rocks tries to preserve existing partitions when it can. If the code below does not work (symptoms include pop-ups during install complaining about FDiskWindow & Kernel panics from incorrectly synced configs after install is complete), try forcing the default partitioning scheme & modifying the Rocks WN partitions after install. In this case, you will probably lose any existing data on the WNs and should use the Rocks boot disk rather than PXE boot. Additionally, our setup somehow causes LABEL synchronization issues on subsequent calls to shoot-node; we must add some commands to extend-compute.xml to fix this issue. The necessary commands to set the WN partitions prior to the first WN Rocks installation:
1. cd /home/install/site-profiles/4.3/nodes/
2. cp skeleton.xml replace-auto-partition.xml
3. Edit the <main> section of replace-auto-partition.xml:
```
<main>
  <part> / --size 8192 --ondisk sda </part>
  <part> swap --size 8192 --ondisk sda </part>
  <part> /var --size 4096 --ondisk sda </part>
  <part> /scratch --size 1 --grow --ondisk sda </part>
  <part> /tmp --size 1 --grow --ondisk sdb </part>
</main>
```
4. cp skeleton.xml extend-compute.xml
5. Edit the <post> section of extend-compute.xml and add:
  e2label /dev/sda1 /
  cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
  cp -f /tmp/fstab /etc/fstab
  cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
  > /tmp/grub.conf
  cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
  chmod +w /boot/grub/grub-orig.conf
  unlink /boot/grub/grub.conf
  ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf
6. cd /home/install
7. rocks-dist dist
Follow the Rocks 4.3 user's guide to set up the worker nodes. Additions to the guide:
1. If you have not already done so, be sure to configure the switch via the serial cable to get its IP via DHCP and set a login name and password (for internet management).
2. We do have a managed switch, so the first task, done by selecting 'Ethernet switches' in the insert-ethers menu, should be performed. The switch takes a long time to issue DHCP requests after powering up; wait at least 30 mins.
3. Quit insert-ethers using the F11 key, not F10.
4. Once insert-ethers has detected the switch, open an internet browser and log into the switch (typically 10.255.255.254, but dbreport dhcpd lists the switch' local IP inside the network-X-Y bracket). The user name and password can be provided to you by the System administrators.
5. Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
6. Continue with the remainder of the Rocks WN instructions.
7. PXE boot can be initiated on all the WNs by striking the F12 key at the time of boot. Alternatively, insert the Rocks Kernel/Boot CD into each WN shortly after pressing the power button.
Security needs to be configured (quickly). Instructions to do so are located in the file ~root/security.txt, readable only by root. If this file was lost during Rocks install, contact the System administrators for the backup. If you are another site following these instructions, you can contact the Sysadmins for a copy, on which to start configuration of your local site security (security tends to be site specific and we don't claim our security is fool-proof). Your identity will need to be confirmed by the Sysadmins.
Be sure to update all the rpm's on all nodes after they are installed.

Modify Rocks

Modify cluster database
Prevent automatic re-install
Non-HN re-installation
Modify non-HN partitions
Configure external network for all other nodes
Add new users
Modify users
Add rolls
Create appliances
Update RPMs

Modify cluster database

The information stored in the Rocks cluster database can be viewed and edited here, user name and password can be obtained from Marguerite Tonjes. The MySQL DB can be restarted by issuing the command /etc/init.d/mysqld restart from the HN as root (su -).

Prevent automatic re-install

Rocks will automatically re-install all nodes except the HN after they have experienced a hard reboot (such as power failure). This is a useful feature during installation stages, but can be a performance issue once the cluster is in a stable configuration. Simply follow the instructions in this Rocks FAQ to disable this feature. Be sure to re-install the nodes (not the HN) to get the changes to propagate. After removing this feature, shoot-node and cluster-kickstart commands will issue the error:
cannot remove lock: unlink failed: No such file or directory
error reading information on service rocks-grub: No such file or directory
cannot reboot: /sbin/chkconfig failed: Illegal seek
which can be safely ignored.

Non-HN re-installation

Some modifications will require the nodes in the Rocks network to be reinstalled. This tends to be true in cases which require you to issue the command 'rocks-dist dist,' typically because you edited an .xml file. In most cases, this involves simply issuing:

ssh-agent $SHELL
ssh-add
shoot-node compute-0-0
(repeat for all desired nodes)

An alternative method of re-shooting the nodes is shown below. It is not clear which approach is superior.

ssh-agent $SHELL
ssh-add
ssh compute-0-0 'sh /home/install/sbin/nukeit.sh'
(repeat for all desired nodes or use cluster-fork)
ssh compute-0-0 '/boot/kickstart/cluster-kickstart'
(repeat for all desired nodes or use cluster-fork)

If you have not yet made nukeit.sh, see the instructions to modify WN partitions.

OMSA cannot be fully installed as a part of the Rocks Kickstart. Be sure to follow the instructions for OMSA non-HN installation in step 3 after every reinstall.

Since Rocks requires a reinstall of nodes every time a change is made to their kickstart files and we have interactive nodes, you may want to wait until a scheduled maintenance time to reinstall. The cluster-fork command is useful to get the desired functionality prior to reinstall:

ssh-agent $SHELL
ssh-add
cluster-fork "command"

"command" can be anything you'd like run on each WN individually, which could include a network-mounted shell script.

After every major reinstall, in addition to testing whatever changes were made, we like to test a few basic capabilities to make sure nothing was broken. A general outline of the tests we perform:

We first check that the nodes are reporting to Ganglia. Failure to do so indicates a serious problem, which will probably only be resolved by going to the RDC to examine the hardware and perform another WN reinstall (after tracking the problem down and fixing it).
We have a "Hello World" C++ program which we compile and run. Failure typically indicates some sort of endemic, low-level problem, which will probably only be solved by another WN reinstall (after you've tracked the problem down and fixed it). Note we do need additional C++ compilers as a part of the Rocks kickstart.
We have a "vanilla" Condor .jdl file which simply executes sleep. We check both that it ran and that it was submitted to nodes other than the submitting node (submit more than 8 jobs - if they all run simultaneously, the jobs were successfully submitted to more than one node). Failure typically indicates a problem with the condor configuration, controlled via a Rocks kickstart file. It may also indicate an error with the network configuration.
We have a very simple CMSSW config that generates a handful of events using only Configuration/StandardSequences (no custom C++ code). This CMSSW program uses Frontier conditions to test our Squid server. It also sends output to a variety of locations to test disk mounts. CMSSW and Squid are installed only on the HN, so WN reinstall should not damage the installations. Failure of the CMSSW program may be indicative of a problem with network disk mounts or PATHs. Failure of Squid during cmsRun (which typically prints errors, but does not quit) typically indicates a network problem.
We have a very simple CMSSW program that analyzes DBS events hosted on the cluster (a basic EDAnalyzer). We do not run the CMSSW program locally. Instead we run the CMSSW job via CRAB, which will test a number of important services all at once. Failure could be due to any number of issues including, but not limited to, gLite, CRAB, or OSG interaction with the WNs. We set the following values in our crab.cfg file:
- pset, output_file : the CMSSW config and name(s) of output file(s)
- se_white_list = UMD.EDU, ce_white_list = UMD.EDU : this tests that we can run jobs in addition to submitting them
- datasetpath : any DBS dataset known to be hosted at the cluster; primarily tests that nodes can access files in the 'file catalog'
- scheduler : we use condor_g the first time for rapid-response debugging. Once the condor_g jobs have completed successfully, we sometimes submit a second CRAB job with glite as the scheduler, particularly if we've made any changes to the gLite-UI install.
- return_data = 0, copy_data = 1, storage_element = hepcms-0.umd.edu, storage_path = /srm/v2/server?SFN=/data/users/srm-drop : tests both our ability to stage-out files using the srm-client from Fermilab and to receive files using the BeStMan server.

Modify non-HN partitions

These instructions are based on this Rocks guide. You will lose any existing data on the node. Additionally, our setup somehow causes LABEL synchronization issues on subsequent calls to shoot-node; we must add some commands to extend-compute.xml to fix this issue.

As root (su -) on the HN:

cd /home/install/site-profiles/4.3/nodes/
cp skeleton.xml replace-auto-partition.xml

If extend-compute.xml does not yet exist:
cp skeleton.xml extend-compute.xml

Edit the <main> section of replace-auto-partition.xml:

<main>
  <part> / --size 8192 --ondisk sda </part>
  <part> swap --size 8192 --ondisk sda </part>
  <part> /var --size 4096 --ondisk sda </part>
  <part> /scratch --size 1 --grow --ondisk sda </part>
  <part> /tmp --size 1 --grow --ondisk sdb </part>
</main>

Edit the <post> section of extend-compute.xml and add:

e2label /dev/sda1 /
cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
cp -f /tmp/fstab /etc/fstab
cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
> /tmp/grub.conf
cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
chmod +w /boot/grub/grub-orig.conf
unlink /boot/grub/grub.conf
ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf

cd /home/install
rocks-dist dist
rocks remove host partition compute-0-0
(repeat through compute-0-7)

Create the /home/install/sbin/nukeit.sh script:

 for i in `df | awk '{print $6}'`
 do
   if [ -f $i/.rocks-release ]
   then
  	   rm -f $i/.rocks-release
   fi
 done

ssh compute-0-0 'sh /home/install/sbin/nukeit.sh'
(repeat for all desired nodes or use cluster-fork)
ssh compute-0-0 '/boot/kickstart/cluster-kickstart'
(repeat for all desired nodes or use cluster-fork)

In some cases the partitions aren't done properly; it is unclear why. Kernel panics when the node attempts to boot are an indicator of this issue (the node will never reconnect, you must physically go to the node to ascertain this condition). In such a case, it is best to force the default partitioning scheme on these nodes, install, then try again with the preferred partitioning scheme. Use the Rocks boot disk as PXE boot does not seem sufficient. To do the default partitioning scheme, simply replace all the <part> lines in replace-auto-partition.xml with:
<part> force-default </part>
You will lose all data on the nodes for which you force the default scheme.

Configure external network for all other nodes:

Follow this Rocks guide for activating and configuring the second ethernet interface.
Use our network configuration to determine the appropriate values to enter. Alternatively, call the script /root/configure-external-network.sh.
These instructions state how to re-install the nodes.

Add new users

First set the default shell for all new users to tcsh. Edit /etc/default/useradd and change the line SHELL to:
SHELL=/bin/tcsh
This is optional, but commands in this guide assume that root uses a bash shell and all other users use a c-shell.

useradd -c "Full Name" -n username
passwd username (select an initial password)
chage -d 0 username
ssh-agent $SHELL
ssh-add (enter the root password)
rocks sync config
rocks sync users

If the big disk array has already been mounted, give the user their own directory:
mkdir /data/users/username
chown username:users /data/users/username

Some notes:

User instructions for first-time logging in are given here.
All the files inside /etc/skel (such as .bashrc & .cshrc) are copied to each new user's /home area. If files in /etc/skel are modified after users have already been made, the existing users need to be informed of the changes. This guide puts environment variables and aliases in /etc/skel so that users can see where important programs are located. Alternatively, environment variables can be placed in /etc/profile (for bash) and /etc/csh.login (for c-shells) and aliases can be placed in /etc/bashrc (for bash) and /etc/csh.cshrc (for c-shells).

Modify users

As root (su -), first utilize standard Linux commands to modify the user (system-config-users provides a GUI if desired). Then update Rocks:

ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users

Note that to delete a user's home area, you must remove it from /export/home manually. You must also remove the relevant lines in /etc/auto.home:

chmod 744 /etc/auto.home
remove the line with the user's name
chmod 444 /etc/auto.home
make -C /var/411

You should also remove their space in /data. On the GN as root (su -):
rm -rf /data/users/username

Add rolls

Download the appropriate .iso file from Rocks. We'll call it rollFile.iso which corresponds to rollName.

As root (su -):

mount -o loop rollFile.iso /mnt/cdrom
cd /home/install
rocks-dist --install copyroll
umount /mnt/cdrom
rocks-dist dist
kroll rollName | bash
init 6

You can check that the roll installed successfully:
dbreport kickstart HEPCMS-0 > /tmp/ks.cfg
Look in /tmp/ks.cfg for something like:
# ./nodes/somefiles.xml (rollName)

While documentation on this is poor, it seems wisest to re-install the WNs to insure the changes are propagated to the WNs.

Create appliances

We create Rocks appliances for our interactive & grid nodes. Both the interactive & grid appliances do not service condor jobs, though they can submit them, so a special section is dedicated in their Kickstart xml files to configure condor correctly for this case.

These commands are executed as root (su -) on the Rocks head node.

To create the grid appliance:

Our grid node Kickstart file is bare bones because OSG cannot be preserved via tarball for later reinstalls. The grid appliance is not intended for subsequent reinstall, so be sure to configure its partition table, external network interface, and Condor, then reinstall before installing any other software on the grid node. Since the grid node has a different partition table than the compute nodes, we modify its partition table inside grid.xml below.

Place the files grid.xml in /home/install/site-profiles/4.3/nodes and grid-appliance.xml in /home/install/site-profiles/4.3/graphs/default.
Create the new Rocks distribution:
cd /home/install
rocks-dist dist
Add an entry for the new grid appliance to the Rocks MySQL database:
rocks add appliance grid membership='Grid Management Node' short-name='gr' node='grid'
Verify that the new XML code is correct:
rocks list appliance xml grid
If this throws an exception, the last line states where the syntax problem is.
Now install the grid node by calling insert-ethers, selecting Grid Management Node, powering up the new node and selecting PXE boot on the new node as it boots.

To create the interactive appliance:

Our interactive node kickstart file is similar to the grid node, except interactive nodes also install gLite-UI & CRAB via Kickstart. Therefore, interactive nodes can be successfully reinstalled via Rocks Kickstart without loss of software.

Navigate to the gLite-UI tarball repository and select your desired version of gLite-UI. These instructions are for 3.1.28-0, though they can be adapted for other releases. Download the lcg-CA yum repo and tarballs where they can be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0.tar.gz"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0-external.tar.gz"
Navigate to the CRAB download page and select your desired version of CRAB. These instructions are for 2_6_1, though they can be adapted for other releases. Download the tarball where it can be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget --no-check-certificate "http://cmsdoc.cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_6_1.tgz"
Place the files interactive.xml in /home/install/site-profiles/4.3/nodes and interactive-appliance.xml in /home/install/site-profiles/4.3/graphs/default.
Create the new Rocks distribution:
cd /home/install
rocks-dist dist
Add an entry for the new grid appliance to the Rocks MySQL database:
rocks add appliance grid membership='Interactive Node' short-name='in' node='interactive'
Verify that the new XML code is correct:
rocks list appliance xml interactive
If this throws an exception, the last line states where the syntax problem is.
Now install the interactive node by calling insert-ethers, selecting Interactive Node, powering up the new node and selecting PXE boot on the new node as it boots.

Update RPMs

We choose to call yum -y update on all nodes to update RPMs instead of serving the updated RPMs from the HN during Kickstart. This increases the time to a fully operational node after reinstall, but saves on human time to track down the many RPMs. Be sure to call yum update on a regular basis on all nodes - you may want to consider creating a cron job to do it, though you'll need to check if a reboot is needed.

Find out if this update will require a reboot:
yum check-update | grep -i kernel
yum check-update | grep -i selinux-policy
yum check-update | grep -i glibc
A new kernel always requires a reboot and the other two are safest with a reboot. If the grid node will have a kernel update, the xfs rpm appropriate to the new kernel needs to be installed:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/kernel-module-xfs-#.#.#-##.ELsmp-0.4-1.x86_64.rpm"
The tomcat-connectors rpm conflicts with files in the httpd rpm. Check if an update to httpd will occur:
yum check-update | grep -i httpd
If so, remove tomcat-connectors (we'll reinstall it when we're done):
yum -y remove tomcat-connectors
Now update (this process could take quite some time):
yum -y update
If tomcat-connectors was removed, install it again:
rpm -ivh /home/install/rocks-dist/lan/x86_64/RedHat/RPMS/tomcat-connectors-1.2.20-0.x86_64.rpm --force
If appropriate, reboot the node, check that the new Kernel is running, and start OMSA, if it's installed:
reboot
uname -r
srvadmin-services.sh start

Upgrade RAID firmware & drivers

Updating firmware will require shutdown of various services as well as reboot of the HN. Be sure to schedule all firmware and driver updates in advance. The instructions below provide details for handling the big disk array (/data), but do not require that it be configured properly before upgrade; indeed, it is recommended that the RAID firmware and drivers be upgraded before mounting the big disk.

Go to www.dell.com
Under support, enter the HN Dell service tag
Select drivers & downloads
Choose RHEL4.5 for the OS
Select SAS RAID Controller for the category
Select the drivers and firmware for PERC 6/E Adaptor and PERC 6/i Integrated for download.
Follow the PhEDEx instructions to stop all PhEDEx services.
Stop OSG services:
/etc/rc3.d/S97bestman stop
cd /sharesoft/osg/ce
. setup.sh
vdt-control --off
Stop what file services we can:
omconfig system webserver action=stop
/etc/init.d/dataeng stop
cd /sharesoft/osg/ce
. setup.sh
vdt-control --off
/etc/rc.d/init.d/nfs stop
umount /data
cluster-fork "umount /data"
As root (su -) on the HN, install the firmware:
1. The firmware download link should go to an executable, which is not the right file to install in Linux. From the executable name and location and by browsing the ftp server, you can extrapolate the location of the READMEs, e.g.:
  wget "ftp://ftp.us.dell.com/SAS-RAID/R216021.txt"
  wget "ftp://ftp.us.dell.com/SAS-RAID/R216024.txt"
2. By reading the READMEs, you can extrapolate the location of the correct binaries, e.g.:
  wget "ftp://ftp.us.dell.com/SAS-RAID/RAID_FRMW_LX_R216021.BIN"
  wget "ftp://ftp.us.dell.com/SAS-RAID/RAID_FRMW_LX_R216024.BIN"
3. Make the binaries executable:
  chmod +x RAID_FRMW_LX_R216021.BIN
  chmod +x RAID_FRMW_LX_R216024.BIN
4. Follow the instructions in the READMEs.
5. Reboot after each firmware upgrade is complete, stopping all relevant services each time the HN comes back up.
As root (su -) on the HN, install the driver:
1. The driver download link should go to the README. From the README name and location and by browsing the ftp server, you can extrapolate the location of the tarball, e.g.:
  wget "ftp://ftp.us.dell.com/SAS-RAID/megaraid_sas-v00.00.03.21-4-R193772.tar.gz"
2. Unpack the tarball:
  tar -zxvf megaraid_sas-v00.00.03.21-4-R193772.tar.gz
3. Print the current status:
  modinfo megaraid_sas
4. Install the appropriate rpms:
  rpm -ivh dkms-2.0.19-1.noarch.rpm
  rpm -ivh megaraid_sas-v00.00.03.21-4.noarch.rpm
5. Print the new status (output should have changed):
  modinfo megaraid_sas
  dkms status
6. Reboot the HN:
  reboot
Reboot all the WNs, as they may have difficulties accessing the network mounted files on the HN:
ssh-agent $SHELL
ssh-add
cluster-fork "reboot"
Be sure to restart the PhEDEx services after WN reboot.

Configure the big disk array

It is recommended but not required that the RAID firmware and drivers are updated prior to configuring the disk array. We chose to use LVM2 on a single partition for the large data array. This will allow for future expansion and simple repartitioning, as the need arises. While it is possible to use 'fdisk' to partition the array, it is not advisable as 'fdisk' does not play nicely with LVM and our total volume size exceeds the 2TB limit. It is also possible to create several smaller partitions and group them together with the 'vgcreate' command, we considered that solution to be overly complicated. We also used the XFS disk format as it is optimized for large disks and works well with Bestman.

Create, format & mount the disk array on the GN
Network mount the disk array on all the nodes

Create, format & mount the disk array on the GN:

As root (su -) on the grid node:

Install XFS:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/kernel-module-xfs-2.6.9-55.ELsmp-0.4-1.x86_64.rpm"
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/xfsprogs-2.6.13-1.SL.x86_64.rpm"
Identify the array's hardware designation with fdisk:
fdisk -l
Our disk array is currently /dev/sdc.
Use GNU Parted to create the partition:
parted /dev/sdc
At the parted command prompt:
mklabel gpt - This changes the partition label to type GUID Partition Table.
mkpart primary 0 9293440M - This creates a primary partition which starts at 0 and ends at 9293440MB.
print - This confirms the creation of our new partition; output should look similar to:

Disk geometry for /dev/sdc: 0.000-9293440.000 megabytes
Disk label type: gpt
Minor Start End Filesystem Name Flags
1 0.017 9293439.983

quit
Assign the physical volumes (PV) for a new LVM volume group (VG):
pcvreate /dev/sdc1
Create a new VG container for the PV. Our VG is named 'data' and contains one PV:
vgcreate data /dev/sdc1
Create the logical volume (LV) with a desired size. The command takes the form:
lvcreate -L (size in KB,MB,GB,TB,etc) (VG name)
So, in our case:
lvcreate -L 9293440MB data
On this command, we receive the error message: Insufficient free extents (2323359) in volume group data: 2323360 required. Sometimes, it is simpler to enter the value in extents (the smallest logical units LVM uses to manage volume space). We will use a '-l' instead of '-L':
lvcreate -l 2323359 data

Confirm the LV details:
vgdisplay
The output should look like:

--- Volume group ---
VG Name               data
System ID
Format                lvm2
Metadata Areas        1
Metadata Sequence No  2
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                1
Open LV               0
Max PV                0
Cur PV                1
Act PV                1
VG Size               8.86 TB
PE Size               4.00 MB
Total PE              2323359
Alloc PE / Size       2323359 / 8.86 TB
Free  PE / Size       0 / 0
VG UUID               tcg3eq-cG1z-czIn-7j5a-YVM1-MT70-sqKAUY

After these commands, the location of the volume is /dev/mapper/data-lvol0 (ascertain by examining the contents of /dev/mapper). Create a filesystem:
mkfs.xfs /dev/mapper/data-lvol0
Create a mount point, edit /etc/fstab, and mount the volume:
mkdir /data
Add the following line to /etc/fstab:
/dev/mapper/data-lvol0 /data xfs defaults 1 2
And mount:
mount /data
Confirm the volume and size:
df -h
Output should look like:
/dev/mapper/data-lvol0 8.9T 528K 8.9T 1% /data
Create subdirectories and set permissions:
mkdir /data/se
mkdir /data/se/store
cd /
ln -s /data/se/store
mkdir /data/users
For all currently existing users:
mkdir /data/users/username
chown username:users /data/users/username
We create an srm dropbox for our users to transfer files via the srm protocol:
mkdir /data/users/srm-drop
chown root:users /data/users/srm-drop
chmod 775 srm-drop
We use a cron job to garbage collect this directory. Edit /var/spool/cron/root and add the line:
49 02 * * * find /data/users/srm-drop -mtime +7 -type f -exec rm -f {} \;
This will remove week-old files from /data/users/srm-drop every day at 2:49am.

Network mount the disk array on all the nodes

These commands network mounts /data on all nodes. First have the GN export /data. As root (su -) on the GN:

Edit /etc/exports on the GN as root (su -):
chmod +w /etc/exports
Add this line to /etc/exports: /data 10.0.0.0/255.0.0.0(rw,async)
chmod -w /etc/exports
Restart the GN NFS service:
/etc/init.d/nfs restart
Have the NFS service start on the GN whenever it's rebooted:
/sbin/chkconfig --add nfs
chkconfig nfs on

Now have the HN mount /data and edit the Kickstart file to mount /data on all other nodes. As root (su -) on the HN:

Edit /etc/fstab on the HN and tell it to get /data from the grid node:
grid-0-0:/data /data nfs rw 0 0
Have the HN mount /data and make the symlink:
mkdir /data
mount /data
cd /
ln -s /data/se/store
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and place the following commands inside the <post></post> brackets:
<file name="/etc/fstab" mode="append">
grid-0-0:/data /data nfs rw 0 0
</file>
mkdir /data
mount /data
cd /
ln -s /data/se/store
cd -
Note that given the Rocks node inheritance structure, the grid node will also have its /etc/fstab file appended with this network mount if it's ever reinstalled. However, since reinstalling the grid node via Rocks Kickstart is highly undesirable anyway, we break the model here. If grid node reinstall is absolutely required, after reinstall, this line needs to be removed from the /etc/fstab file on the grid node and the logical volume line in the previous section needs to be used instead.
Create the new distribution:
cd /home/install
rocks-dist dist
Re-shoot the nodes following these instructions.

Instrument & monitor

These steps must be done after inserting the nodes into the Rocks database via insert-ethers. The BMCs on the WNs will issue DHCP requests which will confuse insert-ethers if you try to add both the node and its BMC at the same time. We configure the Baseboard Management Controllers (BMCs) on the WNs to respond to manual ipmish calls from the HN. However, for automation, we opted to have every node self-monitor, so every node also installs Dell's Open Manage Server Administrator (OMSA). We configured the BMCs on the WNs and IPMI on the HN prior to installing OMSA, but BMC documentation suggests it should be possible to configure the BMCs via OMSA. So the steps we took to configure the BMCs, including machine reboot and changing the settings manually on every node, may not be required. We installed OMSA 5.5 on all nodes and OpenIPMI on the HN from the disk which came with our system, packaged with OpenManage 5.3. Dell has not released OpenIPMI specifically for OpenManage 5.5, but we have not experienced any version mismatches by using the older OpenIPMI client.

Install OpenIPMI on the HN
Configure the WN & IN BMCs
Install & configure OMSA on the HN & GN
Install & configure OMSA on the WNs & INs

Install OpenIPMI on the HN:

At the RDC, start the OS GUI on the HN as root (startx). Insert the Dell OpenManage DVD into the HN drive (labelled Systems Management Tools and Documentation). Install Dell's management station software:

Navigate to /media/cdrecorder/SYSMGMT/ManagementStation/linux/bmc
Install:
rpm -Uvh osabmcutil9g-RHEL-3.0-11.i386.rpm
Navigate to /media/cdrecorder/SYSMGMT/ManagementStation/linux/bmc/ipmitool/RHEL4_x86_64
Install:
rpm -Uvh *rpm
Start OpenIPMI:
/etc/init.d/ipmi start

Configure the WN & IN BMCs:

To configure the BMC's to respond to ipmish command line calls from the HN, reboot each one and configure BIOS and remote access setup.

At boot time, press F2 to enter the BIOS configuration. Set the following:

Serial communication: On with console redirection via COM2
External Serial Communication: leave as COM1
Failsafe Baud Rate: 57600
Remote Terminal Type: leave as VT100/VT200
Redirection after Boot: leave enabled

Enter the remote access setup shortly after BIOS boot by typing Ctrl-E. Set the following:

IPMI Over Lan: On
NIC Selection: Failover
LAN Parameters:
- RMCP + Encryption Key: leave
- IP Address Source: DHCP
- DHCP host name: hepcms-hn
- VLAN Enable: leave off
- LAN Alert Enabled: on
- Alert Policy Entry 1: 10.1.1.1
- Host Name String: compute-x-y bmc
LAN User Configuration: see /root/bmc.txt on the HN (hidden for security)

Before exiting the remote access setup, or as soon as possible afterwards, tell the HN to listen for DHCP requests coming from the BMC. As root (su -) on the HN:

insert-ethers
Select Remote Management
After Rocks recognizes the BMC, exit with the F11 key.

You may need to reboot the WN to get all the new settings to work. To test that it's worked, execute from the HN:

ipmish -ip manager-x-y -u ... -p ... sysinfo

Install & configure OMSA on the HN & GN

Install Dell OpenManage Server Administrator (repeat for the HN & GN):

Set up the environment:
mkdir /share/apps/OpenManage-5.5
cd /share/apps/OpenManage-5.5
Download OMSA:
wget "http://ftp.us.dell.com/sysman/OM_5.5.0_ManNode_A00.tar.gz"
tar -xzvf OM_5.5.0_ManNode_A00.tar.gz
Fool OpenManage into thinking we have a valid OS (which we do):
echo Nahant >> /etc/redhat-release
Install OMSA:
cd linux/supportscripts
./srvadmin-install.sh
Choose "Install all"
Start OMSA:
srvadmin-services.sh start
Check it's running and reporting:
omreport system summary
Navigate to https://hepcms-hn.umd.edu:1311
The files created from unpacking the tarball can be deleted if desired, they were for installation purposes only.

Create the executables which will be called in the event of OMSA detected warnings and failures. We issue notifications via email, including cell phone emails (which can be looked up on your cell phone provider's website):

Create /share/apps/OpenManage-5.5/warningMail.sh:
# /bin/sh
echo "Dell OpenManage has issued a warning on" `hostname` > /tmp/OMwarning.txt
echo "If HN: https://hepcms-hn.umd.edu:1311" >> /tmp/OMwarning.txt
echo "If WN: use ipmish from HN or omreport from WN" >> /tmp/OMwarning.txt
mail -s "hepcms warning" email1@domain1.com email2@domain2.net </tmp/OMwarning.txt>/share/apps/OpenManage-5.5/warningMailFailed.txt 2>&1
Create /share/apps/OpenManage-5.5/failureMail.sh:
# /bin/sh
echo "Dell OpenManage has issued a failure alert on" `hostname` > /tmp/OMfailure.txt
echo "Immediate action may be required." >> /tmp/OMfailure.txt
echo "If HN: https://hepcms-hn.umd.edu:1311" >> /tmp/OMfailure.txt
echo "If WN: use ipmish from HN or omreport from WN" >> /tmp/OMfailure.txt
mail -s "hepcms failure" email1@domain1.com email2@domain2.net </tmp/OMfailure.txt>/share/apps/OpenManage-5.5/failureMailFailed.txt 2>&1
Make them executable and create the error log files:
chmod +x /share/apps/OpenManage-5.5/warningMail.sh
chmod +x /share/apps/OpenManage-5.5/failureMail.sh
touch /share/apps/OpenManage-5.5/warningMailFailed.txt
touch /share/apps/OpenManage-5.5/failureMailFailed.txt

Configure OMSA to handle warnings and failures:

Navigate to https://hepcms-hn.umd.edu:1311 and log in
To configure the HN to automatically shutdown in the event of temperature warnings:
1. Select the Shutdown tab and the "Thermal Shutdown" subtab
2. Select the Warning option and click the "Apply Changes" button
Under the "Alert Management" tab, set the desired warning alerts to execute application /share/apps/OpenManage-5.5/warningMail.sh.
Under the "Alert Management" tab, we set the following failure alerts to execute application /share/apps/OpenManage-5.5/failureMail.sh.
Repeat for the GN (https://hepcms-0.umd.edu:1311).

Install & configure OMSA on the WNs & INs:

We install and configure OMSA via Rocks Kickstart. As root (su -) on the HN:

Place the appropriate installation files to be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm"
wget "http://ftp.us.dell.com/sysman/OM_5.5.0_ManNode_A00.tar.gz"
Add the text in this xml fragment to the <post></post> section of /home/install/site-profiles/4.3/nodes/extend-compute.xml. If you are performing the OMSA install manually from the command line, you can reference the text in the xml fragment to see the commands executed to perform the install. The xml fragment is effectively a shell script, with & characters replaced by & and > by > .
Create the new Kickstart:
cd /home/install
rocks-dist dist
Reinstall all the WNs & INs.
The OMSA install cannot be completed entirely in the Rocks Kickstart.
1. Create a shell script which will complete the installation, /home/install/sbin/OMSAinstall.sh:
  cd /scratch/OpenManage-5.5/linux/supportscripts
  ./srvadmin-install.sh -b
  srvadmin-services.sh start &
2. And a shell script which will configure OMSA, /home/install/sbin/OMSAconfigure.sh.
3. Make them executable:
  chmod +x /home/install/sbin/OMSAinstall.sh
  chmod +x /home/install/sbin/OMSAconfigure.sh
4. And execute them after every WN reinstall:
  ssh-agent $SHELL
  ssh-add
  cluster-fork "/home/install/sbin/OMSAinstall.sh
  cluster-fork "/home/install/sbin/OMSAconfigure.sh

Install CMSSW

Production releases of CMSSW can be installed automatically via OSG tools (email Bockjoo Kim to do so). Automatic installs require you prepare your environment and install squid. The remaining instructions are for manual installations and are taken from this guide. Automatic installs also require you to map Bockjoo's grid certificate to the cmssoft account, see the OSG installation guide for details on how to do this with a grid-mapfile.

Prepare the environment
Install Squid
Install a CMSSW release
Uninstall a CMSSW release

Prepare the environment:

Create a user specifically for CMSSW installs, whom we will call cmssoft, following the instructions for adding new users.
As root (su -) on the grid node, create /scratch/cmssw and cede control to cmssoft:
mkdir /scratch/cmssw
chown -R cmssoft:users /scratch/cmssw
Prepare it to be network mounted by editing /etc/exports and adding the line:
/scratch 10.0.0.0/255.0.0.0(rw,async)
As root (su -) on the head node, network mount /scratch on the grid node as /sharesoft on all nodes:
1. Create /etc/auto.sharesoft file with the content:
  cmssw grid-0-0.local:/scratch/cmssw
  And change the permissions:
  chmod 444 /etc/auto.sharesoft
2. Edit /etc/auto.master:
  chmod 744 /etc/auto.master
  Add the line: /sharesoft /etc/auto.sharesoft --timeout=1200
  chmod 444 /etc/auto.master
3. Inform 411, the Rocks information service, of the change:
  cd /var/411
  make clean
  make
Once /etc/auto.sharesoft has propagated to all the nodes from 411, restart the NFS services on the grid node. As root (su -) on the grid node:
/etc/rc.d/init.d/nfs restart
/etc/rc.d/init.d/portmap restart
service autofs reload
If the NFS service on the GN doesn't already start on reboot, configure that now:
/sbin/chkconfig --add nfs
chkconfig nfs on
Tell WNs to restart their own auto-NFS service. As root (su -) on the head node:
ssh-agent $SHELL
ssh-add
cluster-fork '/etc/rc.d/init.d/autofs restart'
Note: Some directory restarts may fail because they are in use. However, /sharesoft should get mounted regardless.
As cmssoft on the grid node (su - cmssoft), prepare for CMSSW installation following these instructions. Some notes:
1. Set the correct permissions first:
  chmod 755 /scratch/cmssw
2. We use the VO_CMS_SW_DIR environment variable, as we later set up a link which points the appropriate directories in the OSG app directory to this directory:
  setenv VO_CMS_SW_DIR /sharesoft/cmssw
  It's important that this environment variable points to the network mount.
3. We use the same SCRAM_ARCH in the Twiki, e.g.:
  setenv SCRAM_ARCH slc4_ia32_gcc345
4. You can tail -f the log file to watch the install and check if the bootstrap was successful or to see any errors.
We want all users to source the CMSSW environment on login according to these instructions. By placing the source commands in the .cshrc & .bashrc skeleton files, all new users will have the source inside their .cshrc & .bashrc files. Existing users will have to add this manually. As root (su -) on the HN, edit /etc/skel/.cshrc to include the lines:
# CMSSW
setenv VO_CMS_SW_DIR /sharesoft/cmssw
source $VO_CMS_SW_DIR/cmsset_default.csh
Similarly, edit /etc/skel/.bashrc:
# CMSSW
export VO_CMS_SW_DIR=/sharesoft/cmssw
. $VO_CMS_SW_DIR/cmsset_default.sh
If OSG has been installed (instructions below are repeated under OSG installation):
1. Inform BDII that we have the slc4_ia32_gcc345 environment. Edit /sharesoft/osg/app/etc/grid3-locations.txt to include the lines:
  VO-cms-slc4_ia32_gcc345 slc4_ia32_gcc345 /sharesoft/cmssw
  VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
  (modify X_Y_Z and add a new line for each release of CMSSW installed)
2. Create a link to CMSSW in the OSG app directory (set during OSG CE configuration inside config.ini):
  cd /sharesoft/osg/app
  mkdir cmssoft
  ln -s /sharesoft/cmssw cmssoft/cms

Install Squid

The conditions database is managed by Frontier, which requires a Squid web proxy to be installed. We choose to install it on the HN. These instructions are based on these two (1, 2) Squid for CMS guides, be sure to check them for the most recent details.

As root (su -) on the HN:

First create the Frontier user and give it ownership of the Squid installation and cache directory. As root (su -) on the HN:
useradd -c "Frontier Squid" -n dbfrontier -s /bin/bash
passwd dbfrontier
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
mkdir /scratch/squid
chown dbfrontier:users /scratch/squid
Login as the Frontier user (su - dbfrontier).
Download and unpack Squid for Frontier (check this link for the latest version):
wget "http://frontier.cern.ch/dist/frontier_squid-4.0rc9.tar.gz"
tar -xvzf frontier_squid-4.0rc9.tar.gz
cd frontier_squid-4.0rc9
Configure Squid by calling the configuration script:
./configure
providing the following answers:
1. Installation directory: /scratch/squid
2. Network & netmask: 128.8.164.0/255.255.255.192 10.0.0.0/255.0.0.0
3. Cache RAM (MB): 256
4. Cache disk (MB): 5000
Install:
make
make install
Start the Squid server:
/scratch/squid/frontier-cache/utils/bin/fn-local-squid.sh start
You can start the Squid server at boot time. As root (su -):
cp /scratch/squid/frontier-cache/utils/init.d/frontier-squid.sh /etc/init.d/.
/sbin/chkconfig --add frontier-squid.sh
Create a cron job to rotate the logs:
crontab /scratch/squid/frontier-cache/utils/cron/crontab.dat
We choose to restrict Squid access to CMS Frontier queries, since the IPs allowed by Squid include addresses not in our cluster. Edit /scratch/squid/frontier-cache/squid/etc/squid.conf and add the line:
http_access deny !CMSFRONTIER
which should be placed immediately before the line:
http_access allow NET_LOCAL
Then tell Squid to use the new configuration:
/scratch/squid/frontier-cache/squid/sbin/squid -k reconfigure
Test Squid with Frontier
Register your server

To provide new configuration options, call make clean before make to get a fresh install. Be sure to stop the Squid server first (/scratch/squid/frontier-cache/utils/bin/fn-local-squid.sh stop).

We create the site-local-config.xml and storage.xml files as a part of the PhEDEx installation, but they can be created right away. site-local-config.xml should be stored in /sharesoft/cmssw/SITECONF/T3_US_UMD/JobConfig and /sharesoft/cmssw/SITECONF/local/JobConfig while storage.xml should be in /sharesoft/cmssw/SITECONF/T3_US_UMD/PhEDEx and /sharesoft/cmssw/SITECONF/local/PhEDEx. Links provided as a part of the PhEDEx instructions:

Install a CMSSW release:

Login as cmssoft to the GN.
The available CMSSW releases can be listed by:
apt-cache search cmssw | grep CMSSW
Follow these instructions, some notes:
1. Be sure to set VO_CMS_SW_DIR & SCRAM_ARCH, get the environment, and update:
  setenv VO_CMS_SW_DIR /sharesoft/cmssw
  setenv SCRAM_ARCH slc4_ia32_gcc345
  source $VO_CMS_SW_DIR/$SCRAM_ARCH/external/apt/<apt-version>/etc/profile.d/init.csh
  apt-get update
2. RPM style options can be specified with syntax such as:
  apt-get -o RPM::Install-Options::="--ignoresize" install cms+cmssw+CMSSW_X_Y_Z
3. This process takes about an hour, depending on the quantity of data you'll need to download.
4. You can safely ignore the message "find: python: No such file or directory"
If OSG has been installed:
1. Inform BDII that this release of CMSSW is available. As root (su -), edit /sharesoft/osg/app/etc/grid3-locations.txt to include the line:
  VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
2. Edit the grid policy and home page and add the version installed.

Uninstall a CMSSW release

Login as cmssoft to the HN.
List the currently installed CMSSW versions:
scramv1 list | grep CMSSW
If OSG has been installed:
1. Add a link to the CMSSW installation in the osg-app directory:
  cd /sharesoft/osg/app
  mkdir cmssoft
  ln -s /sharesoft/cmssw cmssoft/cms
2. Inform BDII that this release of CMSSW is no longer available. As root (su -), edit /sharesoft/osg/app/etc/grid3-locations.txt and remove the line:
  VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
3. Edit the grid policy page to remove the version and the home page to announce its removal.
Remove a CMSSW release:
apt-get remove cms+cmssw+CMSSW_X_Y_Z

Install CRAB

We install CRAB with gLite-UI on the interactive nodes only. We've had problems trying to install gLite-UI via yum on the Rocks HN and are told we shouldn't install it on the OSG CE or SE. Some people have reported no issues with the gLite-UI tarball when they don't install it as root. gLite-UI is necessary to use the glite scheduler in users' crab.cfg, which allows users to submit directly to EGEE (European) sites. Alternatively, gLite-UI does not have to be installed if users set scheduler=condor_g in their crab.cfg and white list the site they wish to submit to. Additionally, the glidein scheduler can be used by CRAB to submit to any Condor GlideIn enabled CrabServer, such as the one at UCSD, which can then send the job on to any OSG or EGEE CMS site. GlideIn comes with Condor, so you do not have to install gLite-UI to get it.

We install CRAB on our INs using a specially created Rocks appliance. Instructions below are for command-line installs and are adapted from four (1, 2, 3, 4) gLite guides, this YAIM guide, and this CRAB guide.

On the installation node as root (su -):

If you do not already have certificates in /etc/grid-security/certificates, you'll need to download and install the lcg-CA yum repo:
cd /etc/yum.repos.d
wget "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo"
yum install lcg-CA
Navigate to the gLite-UI tarball repository and select your desired version of gLite-UI. These instructions are for 3.1.28-0, though they can be adapted for other releases. Download the tarballs:
mkdir /scratch/gLite
cd /scratch/gLite
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0.tar.gz"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0-external.tar.gz"
mkdir glite-UI-3.1.28-0
ln -s glite-UI-3.1.28-0 gLite-UI
cd gLite-UI
tar zxvf ../glite-UI-3.1.28-0.tar.gz
tar zxvf ../glite-UI-3.1.28-0-external.tar.gz
Make your site-info.def file following these (1, 2) instructions. We place our site-info.def in /scratch/gLite/gLite-UI.
Call YAIM to install and configure gLite-UI using your site-info.def file:
./glite/yaim/bin/yaim -c -s site-info.def -n UI_TAR
gLite-UI has problems with its PYTHONPATH. Edit /scratch/gLite/gLite-UI/external/etc/profile.d/grid-env.sh and add inside the if block:
gridpath_append "PYTHONPATH" "/scratch/gLite/gLite-UI/glite/lib"
gridpath_append "PYTHONPATH" "/scratch/gLite/gLite-UI/lcg/lib"
Navigate to the CRAB download page and select your desired version of CRAB. These instructions are for 2_6_1, though they can be adapted for other releases. Download, install, and configure:
mkdir /scratch/crab
cd /scratch/crab
wget --no-check-certificate "http://cmsdoc.cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_6_1.tgz
tar -xzvf CRAB_2_6_1.tgz
ln -s CRAB_2_6_1 current
cd CRAB_2_6_1
./configure

User instructions for getting the gLite-UI & CRAB environment are here.

Install OSG

These instructions assume you have already installed Pacman, have a personal grid certificate, and have network mounted the big disk array to be used as the SE. The OSG installation and configuration is based on this OSG guide. OSG is built on top of services provided by VDT, so VDT documentation may be helpful to you. These instructions are for OSG 1.2 (our OSG 0.8, OSG 1.0 archives).

We install the worker-node client, the CE, and SE all on the same node (the grid node) and the CE & SE in the same directory. Therefore, we make some configuration choices along the way which might not be applicable for all sites.

Request host certificates
Install and configure the CE, BeStMan, and the WN client
Start the CE & SE
Register with the GOC

Request host certificates:

Follow these instructions. Some notes:

Our full hostname for our grid node is hepcms-0.umd.edu
Enter osg as the registration authority
Enter cms as our virtual organization (VO)
Be sure to run the second request for the http certificate
We make a third request for an rsv certificate. Since we're going to give the rsvuser ownership of the cert, create the user account now. As root (su -) on the HN:
useradd -c "RSV monitoring user" -n rsvuser
passwd rsvuser
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
Once you've received email confirmation that your certificates are approved and you've followed the instructions to retrieve your certificates, copy the files to the appropriate directories on the GN and give them the needed ownerships:
mkdir -p /etc/grid-security/http
cp hepcms-0cert.pem /etc/grid-security/hostcert.pem
cp hepcms-0key.pem /etc/grid-security/hostkey.pem
cp hepcms-0cert.pem /etc/grid-security/containercert.pem
cp hepcms-0key.pem /etc/grid-security/containerkey.pem
cp http-hepcms-0cert.pem /etc/grid-security/http/httpcert.pem
cp http-hepcms-0key.pem /etc/grid-security/http/httpkey.pem
cp rsv-hepcms-0cert.pem /etc/grid-security/rsvcert.pem
cp rsv-hepcms-0key.pem /etc/grid-security/rsvkey.pem
chown daemon:daemon /etc/grid-security/containercert.pem
chown daemon:daemon /etc/grid-security/containerkey.pem
chown daemon:daemon /etc/grid-security/http/httpcert.pem
chown daemon:daemon /etc/grid-security/http/httpkey.pem
chown rsvuser:users /etc/grid-security/rsvcert.pem
chown rsvuser:users /etc/grid-security/rsvkey.pem

Install and configure the CE, BeStMan, and the WN client

These instructions assume /sharesoft has already been network mounted from the grid node /scratch directory. If it hasn't, instructions under the CMSSW installation give the needed steps.

Prepare the environment
Install the compute element
Configure the CE
Get the OSG environment
Configure the grid-mapfile service
Install & configure the storage element
Install the worker node client

Prepare the environment: First we need to prepare for the install by creating the appropriate directories, network mounting, and changing our hostname.

Create the appropriate directories. As root on the GN (su -):
mkdir /scratch/osg
cd /scratch/osg
mkdir wnclient-1.2 ce-1.2
ln -s wnclient-1.2 wnclient
ln -s ce-1.2 ce
ln -s ce-1.2 se
mkdir -p app/etc
chmod 777 app app/etc
mkdir /data/se/osg
chown root:users /data/se/osg
chmod 775 /data/se/osg
Have all nodes (including the GN) mount /scratch/osg on the GN as /sharesoft/osg. Edit /etc/auto.sharesoft on the HN as root (su -) and add the line:
osg grid-0-0.local:/scratch/osg
We use /tmp on the WNs as the temporary working directory for OSG jobs. If you haven't done so already, configure cron to garbage collect /tmp on all of the nodes.
On a Rocks appliance, the command hostname outputs the local name (in our case, grid-0-0) instead of the FQHN. OSG needs hostname to output the FQHN, so we modify our configuration such that hostname prints hepcms-0.umd.edu following these instructions. Specifically:
1. In /etc/sysconfig/network, replace:
  HOSTNAME=grid-0-0.local
  with
  HOSTNAME=hepcms-0.umd.edu
2. In /etc/hosts, add:
  128.8.164.12 hepcms-0.umd.edu
3. Then tell hostname to print the true FQHN:
  hostname hepcms-0.umd.edu
4. And restart the network:
  service network restart

Install the compute element: Install the CE following these instructions. Some notes:

We install in /sharesoft/osg/ce:
cd /sharesoft/osg/ce
The pacman CE install:
pacman -get http://software.grid.iu.edu/osg-1.2:ce
outputs the messages:
INFO: The Globus-Base-Info-Server package is not supported on this platform
INFO: The Globus-Base-Info-Client package is not supported on this platform
which are safe to ignore.
We use our existing Condor installation as our jobmanager, so execute:
. setup.sh
export VDTSETUP_CONDOR_LOCATION=/opt/condor
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:Globus-Condor-Setup
We also use ManagedFork:
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:ManagedFork
$VDT_LOCATION/vdt/setup/configure_globus_gatekeeper --managed-fork y --server y
Since we run our CE & SE on the same node and various CMS utilities assume the SE is on port 8443, we need to change the ports that some CE services run on.
1. Replace 8443 in $VDT_LOCATION/tomcat/v55/conf/server.xml with 7443. The line:
  enableLookups="false" redirectPort="8443" protocol="AJP/1.3"
  should become:
  enableLookups="false" redirectPort="7443" protocol="AJP/1.3"
2. Edit the file $VDT_LOCATION/apache/conf/extra/httpd-ssl.conf to change port 8443 to port 7443. The lines:
  Listen 8443
  RewriteRule (.*) https://%{SERVER_NAME}:8443$1
  <VirtualHost _default_:8443>
  ServerName www.example.com:8443
  should become:
  Listen 7443
  RewriteRule (.*) https://%{SERVER_NAME}:7443$1
  <VirtualHost _default_:7443>
  ServerName www.example.com:7443
Don't forget to run the post install:
vdt-post-install
We download certs to the local directory, which is network mounted and so readable by all nodes in the cluster:
vdt-ca-manage setupca --location local --url osg
The local directory /etc/grid-security/certificates on all nodes which need access to certs should point to the CE $VDT_LOCATION/globus/share/certificates. E.g., as root (su -) on the GN (needed by OSG services) and interactive nodes (needed by CRAB):
mkdir /etc/grid-security
cd /etc/grid-security
ln -s /sharesoft/osg/ce/globus/share/certificates
The WNs will get certificates by following the symlinks we create in the wnclient directory (installation instructions for WN client below). They do not assume that certificates are at /etc/grid-security/certificates.
*Note: This step may no longer be necessary in OSG 1.2. RSV needs to run in the condor-cron queue instead of the global condor pool because it has many lightweight jobs running constantly. Edit ~rsvuser/.cshrc and add:
source /sharesoft/osg/ce/setup.csh
source $VDT_LOCATION/vdt/etc/condor-cron-env.csh
and edit ~rsvuser/.bashrc and add:
. /sharesoft/osg/ce/setup.sh
. $VDT_LOCATION/vdt/etc/condor-cron-env.sh

Configure the CE: Configure the CE following these instructions. Our config.ini is available here for reference. Note that in OSG 1.2, config.ini is placed in the $VDT_LOCATION/osg/etc directory instead of $VDT_LOCATION/monitoring.

Get the OSG environment: We also have users get the OSG environment on login by editing the .bashrc & .cshrc skeleton files. These will be copied to each new user's /home directory. Existing users (such as cmssoft) will have to add the source commands to their ~/.bashrc & ~/.cshrc files. As root (su -) on the HN:

Add to /etc/skel/.bashrc:
. /sharesoft/osg/ce/setup.sh
Add to /etc/skel/.cshrc:
source /sharesoft/osg/ce/setup.csh

Configure the grid-mapfile service: We use a grid-mapfile for user authentication. OSG strongly recommends the use of GUMS, however, we encountered great difficulty running GUMS on our Rocks HN. Follow these instructions to configure the grid-mapfile service. Some notes:

The sudo-example.txt file is located in $VDT_LOCATION/osg/etc.
To edit /etc/sudoers:
visudo
a
Copy and paste changes, being careful to replace symlinks with full paths.
Esc
:wq!
The VOs we support can be limited by editing the file $VDT_LOCATION/edg/etc/edg-mkgridmap.conf and removing all lines but those for the mis, uscms01, and ops users. This file can be overwritten on future pacman updates, so check it each time.
The accounts for each supported VO need to be made. On the HN as root (su -):
useradd -c "Monitoring information service" -n mis -s /bin/true
useradd -c "CMS grid jobs" -n uscms01 -s /bin/true
useradd -c "Monitoring from ops" -n ops -s /bin/true
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
Setting their shell to true is a security measure, as these user accounts should never actually ssh in.
The grid mapfile file can be remade at any time by executing:
$VDT_LOCATION/edg/sbin/edg-mkgridmap
The http cert will be used by the CE to gather information. It needs to be mapped to a user account following these instructions. The DN->user mapping we add to our grid-mapfile-local is:
"/DC=org/DC=doegrids/OU=Services/CN=http/hepcms-0.umd.edu" uscms01
The RSV cert will also be used:
"/DC=org/DC=doegrids/OU=Services/CN=rsv/hepcms-0.umd.edu" rsvuser
If the CMSSW environment is ready and you wish to have Bockjoo perform automatic installs, map his DNs to the cmssoft account:
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim (UFlorida T2 Service) 606361" cmssoft
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim 740786" cmssoft
(these DNs were found by doing a grep on the existing grid-mapfile)

Install & configure the storage element: Install BeStMan-Gateway following these instructions. Some notes:

We install in our /sharesoft/osg/se directory, which is a symlink to our ce installation directory. If you're working in a fresh shell, be sure to source the existing OSG installation:
cd /sharesoft/osg/se
. setup.sh
We use the following configuration settings:
vdt/setup/configure_bestman --server y \
--user daemon \
--cert /etc/grid-security/containercert.pem \
--key /etc/grid-security/containerkey.pem \
--http-port 7070 \
--https-port 8443 \
--globus-tcp-port-range 20000,25000 \
--enable-gateway \
--with-allowed-paths "/tmp;/home;/data" \
--with-transfer-servers gsiftp://hepcms-0.umd.edu
If you call configure_bestman more than once, it will issue the message:
find: /sharesoft/osg/se-1.2/bestman/bin/sharesoft/osg/se-1.2/bestman/sbin/sharesoft/osg/se-1.2/bestman/setup: No such file or directory
Which can be safely ignored.
Don't forget to edit the sudoers file to give daemon needed permissions:
visudo
a
Copy and paste the needed lines
Esc
:wq!
The certificate updater service is already configured to run via the CE, so we don't need to take any special steps for the SE. This is because we installed the SE on the same node and in the the same directory as the CE.
We use the gsiftp server running via the CE software, so don't need any special configuration options for the SE. This is because we installed the SE on the same node as the CE.

Install the worker node client: Now install the worker-node client as root (su -) on the GN in a fresh shell in /sharesoft/osg/wnclient following these instructions. Some notes:

Because we install the WN client on the same network mount as the CE, we have the CE handle certificates. This is option 2 in the Twiki.
The WN client documentation on the OSG ReleaseDocumentation Twiki is out of date as of August 16, 2009. So complete instructions are presented here:
1. Install:
  cd /sharesoft/osg/wnclient
  pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:wn-client
  You can safely ignore the message:
  INFO: The Globus-Base-Info-Client package is not supported on this platform
2. Get the WN client environment:
  . setup.sh
3. Tell the WN client that we will store certificates in the local directory, specifically /sharesoft/osg/wnclient/globus/TRUSTED_CA:
  vdt-ca-manage setupca --location local --url osg
4. Since we will run our CE on the same node, point the WN client TRUSTED_CA directory to the CE TRUSTED_CA directory:
  rm globus/TRUSTED_CA
  ln -s /sharesoft/osg/ce/globus/TRUSTED_CA globus/TRUSTED_CA
5. The original WN client certificate directory can be removed if desired:
  rm globus/share/certificates
  rm -r globus/share/certificates-1.9
Since the WN client is on the same node as the CE, no services need to be enabled or turned on. It is purely a passive software directory from which WNs can grab binaries and configuration.

Start the CE & SE

As root (su -) on the GN:

Start the OSG CE & SE:
cd /sharesoft/osg/ce
. setup.sh
vdt-control --on
This starts all the services for both the CE & SE because we installed them in the same directory.
You can perform a series of simple tests to see if your CE has basic functionality. Login to any user account and:
source /sharesoft/osg/ce/setup.csh
grid-proxy-init
cd /sharesoft/osg/ce/verify
./site_verify.pl
The CEmon log is kept at $VDT_LOCATION/glite/var/log/glite-ce-monitor.log.
The GIP logs are kept at $VDT_LOCATION/gip/var/logs.
globus & gridftp logs are kept in $GLOBUS_LOCATION/var and $GLOBUS_LOCATION/var/log.
The BeStMan log is kept in $VDT_LOCATION/vdt-app-data/bestman/logs/event.srm.log.
Results of the RSV probes will be visible at https://hepcms-0.umd.edu:7443/rsv in 15-30 mins. Further information can be found in the CE $VDT_LOCATION/osg-rsv/logs/probes.
You can force RSV probes to run immediately following these instructions.

After starting the CE for the first time, the file /sharesoft/osg/app/etc/grid3-locations.txt is made. This file is used to publish VO software pins and should be edited every time a new VO software release is installed or removed. If CMSSW is installed (instructions below are repeated in the CMSSW installation):

Add a link to the CMSSW installation in the osg-app directory:
cd /sharesoft/osg/app
mkdir cmssoft
chmod 777 cmssoft
chown cmssoft:users cmssoft
Give cmssoft ownership of the release file:
chown cmssoft:users /sharesoft/osg/app/etc/grid3-locations.txt
As cmssoft (su - cmssoft), create the needed symlink in the OSG APP directory to CMSSW:
cd /sharesoft/osg/app/cmssoft
ln -s /sharesoft/cmssw cms
As cmssoft (su - cmssoft), inform BDII which versions of CMSSW are installed and that we have the slc4_ia32_gcc345 environment. Edit /sharesoft/osg/apps/etc/grid3-locations.txt to include the lines:
VO-cms-slc4_ia32_gcc345 slc4_ia32_gcc345 /sharesoft/cmssw
VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
(modify X_Y_Z and add a new line for each release of CMSSW installed)

Register with the Grid Operations Center (GOC):

This should be done only once per site (we've done already). Registration is done at the OSG Information Management (OIM) web portal. Instructions for registration can be found here; you'll need to register yourself and a resource as a new site. We used an older registration process which is no longer used, but for reference, here are the options we selected for resource registration:

Facility: My Facility Is Not Listed (now that we have registered, we select University of Maryland for any new resources we might add later)
Site: My Site Is Not Listed (again, now that we have registered, we select umd-cms)
Resource Name: umd-cms
Resource Services: Compute Element, Bestman-Xrootd Storage Element
Fully Qualified Domain Name: hepcms-0.umd.edu
Resource URL: http://hep-t3.physics.umd.edu
OSG Grid: OSG Production Resource
Interoperability: Select WLCG Interoperability BDII (Published to WLCG); do not select WLCG Interoperability Monitoring
(Note: We initially opted to not try to pass SAM tests. At some point, our site was added to CMS SAM tests and we now pass. It's not clear if this option should be selected to have CMS SAM begin testing your site right away.)
GOC Logging: Do not select Publish Syslogng
Resource Description: Tier-3 computing center. Priority given to local users, but opportunistic use by CMS VO allowed.

Once registration has completed, monitoring info will be here.

Install PhEDEx

We configure PhEDEx to use srm calls directly, instead of 'FTS'. FTS is the most commonly used service by Tier 1 and Tier 2 sites because it tends to be more scalable. FTS requires gLite, which may conflict with an existing CRAB gLite-UI install, so be sure to install PhEDEx on a different node in this case. Regardless, our current installation of PhEDEx does not use gLite. We install PhEDEx on grid-0-0. These instructions are adapted from these (1, 2, 3, 4, 5) PhEDEx guides.

These instructions assume you have already done all the major tasks except for the CRAB install. Specifically, you need to have configured the big disk, created the grid node (via Rocks appliance) and configured its external network connection, and installed OSG. You will also need to have Kerberos configured, CVS installed and configured, and CMSSW installed.

Site registration
Install on the GN
Get proxy & start services
Clean logs
Commission links

Site registration

Site registration is done only once for a site. These instructions are based on this PhEDEx guide, be sure to consult it for the most recent details. You can register your site in SiteDB prior to OSG GOC registration, however, once OSG GOC registration is complete, you should change your SAM name to your OSG GOC name by filing a new Savannah ticket.

Create a Savannah ticket with your user public key (usercert.pem) and with the information:
1. Site name: UMD
2. CMS name: T3_US_UMD
3. SAM name: umd-cms (our OSG GOC registration name)
4. City/Country: College Park, MD, USA
5. Site tier: Tier 3
6. SE host: hepcms-0.umd.edu
7. SE kind: disk
8. SE technology: BeStMan
9. CE host: hepcms-0.umd.edu
10. Associate T1: FNAL
11. Grid type: OSG
12. Data manager: Marguerite Tonjes
13. PhEDEx contact: Marguerite Tonjes
14. Site admin: Marguerite Tonjes
15. Site executive: Nick Hadley
Email the persons listed here and ask them to add our site to the PhEDEx database, including a link to the Savannah ticket (CERN phonebook).
Once someone has responded to say UMD has been put into SiteDB, go to https://cmsweb.cern.ch/sitedb/sitedb/sitelist/
1. Log in with your CERN hypernews user name and password
2. Under Tier 3 centres, click on the T3_US_UMD link
3. Click on "Edit site information" and specify OSG as our Grid Middleware, our site home page as http://hep-t3.physics.umd.edu and our site logo URL as http://hep-t3.physics.umd.edu/images/umd-logo.gif
4. We can also add/edit user information by clicking on "Edit site contacts":
  1. Click on "edit" to edit an existing user's info
  2. Click on "Add a person with a hypernews account to site" to add someone new
  3. Then click on the first letter of the user's last name. Note that many users are listed by their middle name instead of their last.
  4. Find the user in the list, and click "edit"
  5. A new page will appear. Click on appropriate values ("Site Admin", "Data Manager",etc.) in the last row of the new page (for the Tier 3), and click "Edit these details" to save.
5. Under Site Configuration, select "Edit site configuration":
  1. CE FQDN: hepcms-0.umd.edu
  2. SE FQDN: hepcms-0.umd.edu
  3. PhEDEx node: T3_US_UMD
  4. GOCDB ID: leave blank
  5. Install development CMSSW releases?: Do not check
  6. Site installs software manually?: Check

Install on the GN

These instructions are for PhEDEx 3.2.9, though they can be adapted for later releases.

Prepare for the PhEDEx install. On the HN as root (su -):

Create the PhEDEx user:
useradd -c "PhEDEx" -n phedex -s /bin/bash
passwd phedex
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
Change ownership of the directory on /data which PhEDEx will use:
chown phedex:users /data/se/store
chmod 775 /data/se/store
And as root on the GN:
mkdir /localsoft/phedex
chown phedex:users /localsoft/phedex

As phedex (su - phedex) on the GN:

Set up the environment:
cd /localsoft/phedex
mkdir 3.2.9
ln -s 3.2.9 current
cd 3.2.9
Install PhEDEx following these instructions. Some notes:
1. Get the CMSSW libraries in your environment before calling the bootstrap script:
  export VO_CMS_SW_DIR=/sharesoft/cmssw
  . $VO_CMS_SW_DIR/cmsset_default.sh
2. We set myarch=slc4_amd64_gcc345
3. apt-cache search won't work until after calling
  source $sw/$myarch/external/apt/*/etc/profile.d/init.sh
  which will only work after you set the sw & myarch environment variables, as well as downloading and executing the bootstrap script.
4. We set version=3_2_9
5. We use the srm client already installed and network mounted on the OSG CE (we tell PhEDEx to grab the environment in the ConfigPart.Common file).
6. We use the JDK already installed and network mounted on the OSG CE. No special modifications to PhEDEx to use it were required.
Configure PhEDEx following these (1, 2) instructions. Examples of site configuration can be found here. Our local site configuration can be found here. Some notes:
1. Our site name is T3_US_UMD, so our configuration directories are
  $PHEDEX_BASE/SITECONF/T3_US_UMD/PhEDEx
  and
  $PHEDEX_BASE/SITECONF/T3_US_UMD/JobConfig
2. We had to modify more than just storage.xml, so be sure to check all the files in the directories for differences from the default templates.
3. The JobConfig directory is not actually needed by PhEDEx, it's needed by CMSSW. We choose to put it in our PhEDEx installation area as well (it's harmless).
4. CMSSW jobs also need the files in your SITECONF directory. Copy the entire SITECONF directory to the $CMS_PATH directory:
  su -
  cp -r /localsoft/phedex/current/SITECONF /sharesoft/cmssw/.
  cp -r /sharesoft/cmssw/SITECONF/T3_US_UMD /sharesoft/cmssw/SITECONF/local
  chown -R cmssoft:users /sharesoft/cmssw/SITECONF
  logout
  Some sites use different storage.xml files in their $PHEDEX_BASE and $CMS_PATH directories to handle CRAB stage-out of files without a locally installed storage element. Since we have a storage element, ours are the same.
5. After starting services (detailed in the next section) for the first time, you can test your storage.xml file by:
  cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
  eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
  Test srmv2 mapping from LFN to PFN:
  /localsoft/phedex/current/sw/slc4_amd64_gcc345/cms/PHEDEX/PHEDEX_3_2_0/Utilities/TestCatalogue -c storage.xml -p srmv2 -L /store/testfile
  Test srmv2 mapping from PFN to LFN:
  /localsoft/phedex/current/sw/slc4_amd64_gcc345/cms/PHEDEX/PHEDEX_3_2_0/Utilities/TestCatalogue -c storage.xml -p srmv2 -P srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/data/se/store/testfile
  Other transfers types can be tested by changing the protocol tag srmv2 to direct, srm, or gsiftp and changing the PFN or LFN argument passed to match. PhEDEx services don't need to be running to do these tests, but the first time PhEDEx is started, it creates some of the needed directories for this test.
Submit a Savannah ticket for a CVS space under /COMP/SITECONF named T3_US_UMD. Once you receive the space, upload your site configuration to CVS:
/usr/kerberos/bin/kinit -5 username@CERN.CH
cvs co COMP/SITECONF/T3_US_UMD
cp -r /localsoft/phedex/current/SITECONF/T3_US_UMD/* COMP/SITECONF/T3_US_UMD/.
cd COMP/SITECONF/T3_US_UMD
cvs add PhEDEx
cvs add PhEDEx/*
cvs commit -R -m "T3_US_UMD PhEDEx site configuration" PhEDEx
Once your initial registration request is satisfied, you will receive three emails titled "PhEDEx authentication role for Prod (Debug, Dev)/UMD." Copy and paste the commands in the email to the command line. Copy the text output for each into the file /localsoft/phedex/current/gridcert/DBParam. Each text output should look something like (exact values removed for security):
Section Prod/UMD
Interface Oracle
Database db_not_shown_here
AuthDBUsername user_not_shown_here
AuthDBPassword LettersAndNumbersNotShownHere
AuthRole role_not_shown_here
AuthRolePassword LettersAndNumbersNotShownHere
ConnectionLife 86400
LogConnection on
LogSQL off

Get proxy & start services

After reboot of the grid node, the grid certificate and proxy should still be valid, but PhEDEx services aren't configured to start automatically. On the grid node:

Copy your personal usercert.pem and userkey.pem grid certificate files into ~phedex/.globus and give the phedex user ownership:
chown phedex:users ~phedex/.globus/*
As phedex, create your grid proxy:
voms-proxy-init -voms cms -hours 350 -out /localsoft/phedex/current/gridcert/proxy.cert
Be sure to make note of when the proxy will expire and log on to renew it before then. Some sites will not accept proxies older than a week, so if you have many links, you will probably need to renew your proxy every week.
Now start the services. To be extra safe, each service should be started in a new shell, though in most cases, executing the following in sequence should be OK:
1. Start the Dev service instance:
  cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
  eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
  /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev start
  This service can be stopped by changing the command start to stop.
2. Start the Debug service instance:
  cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
  eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
  /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug start
  This service can be stopped by changing the command start to stop.
3. Start the Prod service instance:
  cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
  eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
  /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod start
  This service can be stopped by changing the command start to stop.

Clean Logs:

PhEDEx does not clean up its own logs. The first time you start the PhEDEx services, it will create the log files. We use logrotate in cron to clean them monthly, as well as to retain two months of old logs. After starting PhEDEx services at least once on the phedex node:

Create the backup directories:
mkdir /localsoft/phedex/current/Dev_T3_US_UMD/logs/old
mkdir /localsoft/phedex/current/Debug_T3_US_UMD/logs/old
mkdir /localsoft/phedex/current/Prod_T3_US_UMD/logs/old
Create the file /home/phedex/phedex.logrotate with the contents (this logrotate guide was helpful):
rotate 2
monthly
olddir old
nocompress

/localsoft/phedex/current/Dev_T3_US_UMD/logs/* {
   prerotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev stop
   endscript
   postrotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev start
   endscript
}

/localsoft/phedex/current/Debug_T3_US_UMD/logs/* {
   prerotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug stop
   endscript
   postrotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug start
   endscript
}

/localsoft/phedex/current/Prod_T3_US_UMD/logs/* {
   prerotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod stop
   endscript
   postrotate
      cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
      eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
      /localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod start
   endscript
}
Run logrotate from the command line to check that it works:
/usr/sbin/logrotate -f /home/phedex/phedex.logrotate -s /home/phedex/logrotate.state
As root (su -), automate by editing /var/spool/cron/phedex and adding the line:
52 01 * * 0 /usr/sbin/logrotate /home/phedex/phedex.logrotate -s /home/phedex/logrotate.state
Which will direct logrotate to run every Sunday at 1:52 as the user phedex.
Additionally, the Prod download-remove agent doesn't clean up its job logs. As root, edit /var/spool/cron/phedex and add the line:
02 00 * * 0 find /localsoft/phedex/current/Prod_T3_US_UMD/state/download-remove/*log -mtime +7 -type f -exec rm -f {} \;

Commission links:

To download data using PhEDEx, a site must have a Production link originating from one of the nodes hosting the dataset. To create each link, sites must go through a LoadTest/link commissioning process. Our Production links to download to our site are listed here. These instructions are adapted from this Twiki.

The first link you'll want to commission is from the T1_US_FNAL_Buffer. To commission from FNAL, send a request to begin the link commissioning process to hn-cms-ddt-tf@cern.ch. To commission links from other sites, contact the PhEDEx admins for that site as listed in SiteDB (requires Firefox). Ask them if a link is OK and if so, to please create a LoadTest.
For non-FNAL sites, create a Savannah ticket requesting that the Debug link be made from the other site to T3_US_UMD. Select the data transfers category, set the severity as 3-Normal, the privacy as public and T3_US_UMD as the site.
PhEDEx or originating-site admins may create the transfer request for you. If they do, follow the link in the PhEDEx transfer request email sent to you to approve the request. If they do not, create the transfer request yourself:
1. Go to the PhEDEx LoadTest injection page and under the link "Show Options," click the "Nodes Shown" tab, then select the source node.
2. Find T3_US_UMD in the "Destination node" column and copy the "Injection dataset" name.
3. Create a transfer request and copy the dataset name into the "Data Items" box. Select T3_US_UMD as the destination. The DBS is typically LoadTest07, but some sites may create the subscription under LoadTest. You will receive an error if you select the wrong one - simply go back and select the other DBS. Leave the drop down menus as-is (replica, growing, low priority, non-custodial, undefined group). Enter as a comment something to the effect of "Commissioning link from T1_US_FNAL_Buffer to T3_US_UMD," then click the "Submit Request" button.
4. As administrator for the site, you should be able to approve the request right away, simply select the "Approve" radio button and submit the change.
Files created by load tests should be removed shortly after they are created.
- To use a cron job that will remove LoadTest files on regular intervals, login to the GN as root (su -), edit /var/spool/cron/root and add the line:
  07 * * * * find /data/se/store/PhEDEx_LoadTest07 -mmin +180 -type f -exec rm -f {} \;
  37 * * * * find /data/se/store/PhEDEx_LoadTest07 -depth -type d -mmin +180 -exec rmdir --ignore-fail-on-non-empty {} \;
  This will remove three hour old PhEDEx load test files every hour at the 7th minute.
- Or you can configure the Debug agent to delete files immediately after download. To do this, base your PhEDEx configuration on the T3_US_FNALXEN configuration.
Once load tests have been successful at a rate of >5 MB/sec for one day, the link qualifies as commissioned and PhEDEx admins will create the Production link. If PhEDEx admins don't take note of the successful tests within a week, you can send a reminder to hn-cms-ddt-tf@cern.ch or reply to the Savannah ticket that the link passes commissioning criteria and that you'd like the Prod link to be created.

Install/configure other software

Software which must be usable by the worker nodes should be installed in the head node /export/apps directory. /export/apps is cross mounted across all nodes and is visible by all nodes as the /share/apps directory.

RPMforge
xemacs/emacs
Pacman
Kerberos
CVS
Subversion
cron garbage collection
Condor

RPMforge:

RPMforge helps to resolve package dependencies when installing new software. It enables RPMforge repositories in smart, apt, yum, and up2date. We use yum. Packages are installed both on the HN and on the WNs, so RPMforge needs to be installed for both. These instructions are adapted from RPMforge and Rocks.

To install RPMforge on the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://packages.sw.be/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm"
rpm -Uhv rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm
To install RPMforge on the WNs:
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add the following line:
<package>rpmforge-release</package>
Make a new Rocks kickstart distribution:
cd /home/install
rocks-dist dist
Reinstall the WNs.

xemacs/emacs:

Rocks does not install xemacs on any nodes nor emacs on the WNs. These installation instructions below assumed that you have installed RPMforge on the HN to resolve package dependencies. Instructions to install on the WNs are adapted from this Rocks guide. The interactive nodes and grid nodes install emacs via <package type="meta"> tags in their Kickstart files, which installs software bundles.

Install xemacs on the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-common-21.4.15-10.EL.1.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-21.4.15-10.EL.1.x86_64.rpm"
yum localinstall xemacs-common-21.4.15-10.EL.1.x86_64.rpm
yum localinstall xemacs-21.4.15-10.EL.1.x86_64.rpm
Install xemacs and emacs on the WNs:
"http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/apel-xemacs-10.6-5.noarch.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/FreeWnn-libs-1.10pl020-5.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/Canna-libs-3.7p3-7.EL4.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-sumo-20040818-2.noarch.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/emacs-common-21.3-19.EL.4.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/emacs-21.3-19.EL.4.x86_64.rpm"
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml by adding the following <package> lines:
<package>Canna-libs</package>
<package>FreeWnn-libs</package>
<package>apel-xemacs</package>
<package>xemacs-sumo</package>
<package>xemacs-common</package>
<package>xemacs</package>
<package>emacs-common</package>
<package>emacs</package>
Create the new Rocks kickstart distribution:
cd /home/install
rocks-dist dist
Re-shoot the WNs.

It is not entirely clear if all these rpm files really must be downloaded (they should come with the SL4.5 release), but the instructions above have been verified to work.

Pacman:

We install Pacman on the HN and GN. Pacman 3.28 or later is required for the BeStMan release which comes packaged with OSG 1.2. As root (su -) on each node:

Download the latest Pacman:
wget "http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.gz"
Unzip to /usr:
tar xzvf pacman-latest.tar.gz -C /usr
Source the setup script for the first time:
cd /usr/pacman-x.xx
. setup.sh
Edit ~root/.bashrc to include the source:
. /usr/pacman-x.xx/setup.sh

Kerberos:

These instructions enable getting kerberos tickets from FNAL and from CERN. User instructions for kerberos authentication are given here.

Configure Kerberos on the HN. As root (su -) on the HN:

To enable FNAL tickets, save this file as /etc/krb5.conf.
To enable CERN tickets, add to /etc/krb.conf:
CERN.CH
CERN.CH afsdb1.cern.ch
CERN.CH afsdb3.cern.ch
CERN.CH afsdb2.cern.ch
And add to /etc/krb.realms:
.cern.ch CERN.CH
Configure ssh to use Kerberos tickets:
Make the appropriate file writeable:
chmod +w /etc/ssh/ssh_config
Add the lines to /etc/ssh/ssh_config:
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes
Remove writeability:
chmod -w /etc/ssh/ssh_config
Restart the ssh service:
/etc/init.d/sshd restart
Add to /etc/skel/.cshrc:
# Kerberos
alias kinit_fnal '/usr/kerberos/bin/kinit -A -f'
alias kinit_cern '/usr/kerberos/bin/kinit -5'
Add to /etc/skel/.bashrc and to ~root/.bashrc:
# Kerberos
alias kinit_fnal='/usr/kerberos/bin/kinit -A -f'
alias kinit_cern='/usr/kerberos/bin/kinit -5'

Configure Kerberos on the WNs. As root (su -) on the HN:

Copy krb5.conf to where it can be served from the HN during WN install:
cp /etc/krb5.conf /home/install/contrib/4.3/x86_64/RPMS/krb5.conf
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add to the <post> section:
wget -P /etc http://<var name="Kickstart_PublicHostname"/>/install/rocks-dist/lan/x86_64/RedHat/RPMS/krb5.conf
<file name="/etc/ssh/ssh_config" mode="append">
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes
</file>
<file name="/etc/krb.conf" mode="append">
CERN.CH
CERN.CH afsdb1.cern.ch
CERN.CH afsdb3.cern.ch
CERN.CH afsdb2.cern.ch
<file>
<file name="/etc/krb.realms" mode="append">
.cern.ch CERN.CH
</file>
Create the new Rocks distribution:
cd /home/install
rocks-dist dist
Reinstall the WNs

CVS:

CVS needs to be configured to automatically contact the CMSSW repository using Kerberos-enabled authentication. A Kerberos-enabled CVS client is already installed on the HN, but the WNs use a version of CVS distributed by Rocks, which needs to be updated. While this is a one-time install, we believe it must be done after at least one version of CMSSW has been installed on your system. At the very least, it must be done after the one-time CMSSW install commands. Of course, Kerberos authentication to CERN must also be configured. These instructions also assume that RPMforge is installed on the WNs. These instructions are based on this FAQ.

On the GN as cmssoft (su - cmssoft), install the CMSSW CVS configuration package:

source /sharesoft/cmssw/slc4_ia32_gcc345/external/apt/<version>/etc/profile.d/init.csh apt-get update
apt-get install cms+cms-cvs-utils+1.0-cms

On the HN as root (su -):

Download the Kerberos-enabled CVS client for the other nodes:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/cvs-1.11.17-9.RHEL4.x86_64.rpm"
Install the Kerberos-enabled CVS on the other nodes. Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add to the <post> section:
wget "http://<var name="Kickstart_PublicHostname"/>/install/rocks-dist/lan/x86_64/RedHat/RPMS/cvs-1.11.17-9.RHEL4.x86_64.rpm"
yum -y localinstall cvs-1.11.17-9.RHEL4.x86_64.rpm
Create the new distribution:
cd /home/install
rocks-dist dist
Reinstall the non-HN nodes.

User instructions for CVS checkout are given here.

Subversion:

We already had RPMforge (to resolve dependencies) at the time that we install subversion. A dependency resolver, such as RPMforge, may be required to install subversion.

yum install subversion

cron garbage collection:

These instructions provide the cron and Rocks kickstart cron commands to add garbage collection of /tmp for all nodes.

First, create a cron job on the HN and GN. As root (su -) on each, edit /var/spool/cron/root and add the line:
6 * * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;
This will remove day-old files in /tmp on the HN & GN every hour on the 6th minute and week-old empty directories in /tmp every Saturday at 2:36.

Now create the cron job on the WNs & INs:

Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and place the following commands inside the <post></post> brackets:

<file name="/var/spool/cron/root" mode="append">
6 * * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;
</file>
Create the new distribution:
cd /home/install
rocks-dist dist
Re-install the WNs & INs

Condor

We install Condor using the Rocks roll, then modify it to add Condor_G as a part of the OSG installation. To be safe, you should configure condor after you've installed OSG. These instructions are based on the very complete guide provided by Condor.

First we handle two issues: (1) there is a domain mismatch between internal and external hostnames from Rocks, and (2) CMSSW jobs cannot be evicted and resumed without loss of compute cycles. On the HN as root (su -):

Edit /opt/condor/etc/condor_config.local and add the lines:

 TRUST_UID_DOMAIN = True
 PREEMPTION_REQUIREMENTS = False
 NEGOTIATOR_CONSIDER_PREEMPTION = False
 CLAIM_WORKLIFE = 300
 WANT_SUSPEND = True
 SUSPEND = ( (CpuBusyTime > 2 * $(MINUTE)) \
           && $(ActivationTimer) > 300 )
 CONTINUE = $(CPUIdle) && ($(ActivityTimer) > 10)
 PREEMPT = False

Replace the original Rocks Condor roll xml file that creates the condor_config.local file on the other nodes:
cp /home/install/rocks-dist/lan/x86_64/build/nodes/condor-client.xml /home/install/site-profiles/4.3/nodes/replace-condor-client.xml

Edit /home/install/site-profiles/4.3/nodes/replace-condor-client.xml and add the following inside the cat of /opt/condor/etc/condor_config.local (between lines with CONFEOF):

TRUST_UID_DOMAIN = True
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
CLAIM_WORKLIFE = 300
WANT_SUSPEND = True
SUSPEND = ( (CpuBusyTime &gt; 2 * $(MINUTE)) &amp;&amp; $(ActivationTimer) &gt; 300 )
CONTINUE = $(CPUIdle) &amp;&amp; ($(ActivityTimer) &gt; 10)
PREEMPT = False

Additionally, the interactive and grid nodes should not actually service condor jobs, they should only submit them. We fix this by copying the entire <file name="/etc/rc.d/rocksconfig.d/post-90-condor-client"> section of replace-condor-client.xml to the interactive.xml and grid.xml Kickstart files and replace the CondorConf tag "-t se" with "-t s".

Now we need to restart services, create the new Rocks distribution, and reinstall all the non-HN nodes. As root (su -) on the HN:

Restart the Condor service on the HN:
/etc/init.d/rocks-condor restart
If OSG is installed, the OSG condor-devel service (as well as RSV, which uses condor-devel) needs to be restarted:
ssh grid-0-0
cd /sharesoft/osg/ce
vdt-control --off osg-rsv condor-devel
vdt-control --on condor-devel osg-rsv
Create the new Rocks distribution:
cd /home/install
rocks-dist dist
Reinstall the other nodes.

We create a simple condor monitoring script that will route output to the web server, to be viewed by users:

Create the file /root/condor-status-script.sh with the contents:
#!/bin/bash
. /root/.bashrc
OUTPUT=/var/www/html/condor_status.txt
echo -e " \n\n" >$OUTPUT
echo -e "As of `date` \n">>$OUTPUT
/opt/condor/bin/condor_status -submitters >>$OUTPUT
/opt/condor/bin/condor_userprio -all >>$OUTPUT
/opt/condor/bin/condor_status -run >>$OUTPUT
Run it every 10 minutes by editing /var/spool/cron/root and adding the line:
1,11,21,31,41,51 * * * * /root/condor-status-script.sh
Output will be here.

Condor keeps logs in /var/opt/condor/log, StartLog & StarterLog are particularly useful. Generally, the most information can be found on the node which serviced (not submitted) the job you are attempting to get info on.

Backup critical files

The files below should be backed up to a secure non-cluster location. Marguerite Tonjes currently maintains the backup of these files. Users can use /data as a backup location, but this is not a sufficient backup location for these critical admin files. Note that many of these files are readable only by root.

In ~root:
- network-ports.txt
- configure-external-network.sh
- security.txt
- hepcms-0cert.pem
- hepcms-0key.pem
- http-hepcms-0cert.pem
- http-hepcms-0key.pem
- OSG (directory and contents)
- condor-status-script.sh
In /etc:
- krb5.conf
- krb.conf
- krb.realms
- fstab
- exports
- auto.master
- auto.home
- auto.software
- skel (directory and contents)
- sysconfig/iptables
In /home/install/site-profiles/4.3/nodes:
- extend-compute.xml
- replace-auto-partition.xml
- replace-auto-kickstart.xml (if it exists)
- replace-condor-client.xml
- interactive.xml
- grid.xml
In /home/install/site-profiles/4.3/graphs/default:
- interactive.xml
- grid.xml
/home/phedex/phedex.tgz
/var/www/html/index.html
/var/spool/cron (directory and contents)
/sharesoft/osg/ce/osg/etc/config.ini
/sharesoft/cmssw/SITECONF (directory and contents)

We have a backup script, /root/backup-script.sh, which is run by cron on a weekly basis. It will copy all the needed files to /root/backup, which should then be manually copied from the cluster to a different machine on a regular basis.

Note: /sharesoft/osg/ce cannot realistically be used to recover from total HN failure because some OSG services are placed outside of /sharesoft/osg/ce. But it's usually safe to recover from a backup of this directory when attempting to perform OSG software upgrades. When performing the backup, be sure to preserve existing permissions (cp -pr /sharesoft/osg/ce <backup dir>).

Recover from failure

Note: This section is currently inaccurate and under modifications due to our recent change in site configuration.

A HN failure which requires HN reboot is relatively easy to deal with and simply involves the manual starting of a few services. A HN failure which requires reinstall is difficult because the WNs must be reinstalled as well. Instructions are also provided to powerdown the entire cluster and turn the entire cluster back on. This Rocks guide can help to upgrade or reconfigure the HN with minimal impact - you may want to append the files listed here to the FILEs directive in version.mk (files in /home/install/site-profiles are saved automatically).

Power down and up procedures
Recover from HN reboot
Recover from HN reinstall
Recover from GN reboot
Recover from GN reinstall

Power down and up procedures

Before powering down, make sure you have a recent copy of the critical files to backup. Our backup script places all the need critical files in /root/backup on a weekly basis. To power down, login to the HN as root (su -):

cd /sharesoft/osg/ce
. setup.sh
vdt-control --off
condor_status will show if any jobs are running, if they are, shut down condor without killing jobs following this condor recipe
ssh-agent $SHELL
ssh-add
cluster-fork "poweroff"
poweroff

If you are concerned about the possibility of power spikes, go to the RDC:

Flip both power switches on the back of the big disk array.
Fkip the power switch on the KVM (in the back of the rack).
Turn the UPS off by pressing the O (circle) button.
Flip the power switch on the back of the UPS.
Flip the power switches on both large PDU's, in the middle of the
rack. Each large PDU has two switches.
Remove the floor tile directly behind the cluster.
If possible without undue strain to the connectors, unplug both power cables from their sockets.
Replace the floor tile.

To power up, go to the RDC:

If applicable:

Remove the floor tile directly behind the cluster.
Plug in power cables in the floor.
Replace the floor tile.
Flip UPS, big PDU, and KVM power switches.
Turn UPS on by pressing | / Test button on the front.
Turn the big disk array on by flipping both switches in the back. Flip one switch, wait for the disks and fans to spin up, then spin down. Then flip the second switch.

Once the big disk array fans and disks have spun down from their initial spin up:

Press power button on HN. Wait for it to boot completely.
Power cycle the switch using its power cable (the switch has no switch
hardy har har).
Login on the HN as root, start the GUI environment (startx).
Open an internet browser and enter the address 10.255.255.254. If you don't get a response, wait a few more minutes for the switch to complete its startup, diagnoses, and configuration.
Log into the switch (user name and password can be obtained from Marguerite Tonjes).
Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
Press the power buttons on all eight WNs. Wait a few seconds between each one.
Follow the procedure below to recover from HN reboot.

Note: While our cluster has the ability to be powered up completely from a network connection, it has not yet been configured. At the present time, powering up requires a visit to the RDC.

Recover from HN reboot

BeStMan & OSG should be started automatically at boot time. As root (su -) on the HN:

Check RSV probes. If any probes are failing, it may be due to cron maintenance jobs for OSG which haven't run yet. Issue the command:
crontab -l | grep osg
and scan for any jobs with names that are similar to the failing probe. Execute the command manually and wait for the next RSV probe to run.
If you rebooted the phedex node (phedex-node-0-7), you must restart the PhEDEx services following these instructions.
PhEDEx, if still running, will reconnect with the BeStMan service automatically. You can verify that the instances are still running by checking the files on phedex-node-0-7:
/scratch/phedex/current/Debug_T3_US_UMD/logs/download-srm
/scratch/phedex/current/Prod_T3_US_UMD/logs/download-srm
If PhEDEx does not reconnect, follow these instructions to stop and start the PhEDEx services.
Check Ganglia. All nodes should be reporting, it is highly unlikely that HN reboot alone would cause WNs to stop reporting. However, if they are not reporting, try restarting the Ganglia service:
/etc/init.d/gmond restart
/etc/init.d/gmetad restart
If a node is still not reporting, you can attempt to reboot the WN:
ssh-agent $SHELL
ssh-add
ssh compute-x-y 'reboot'
or, to reboot all WNs:
cluster-fork "reboot"

Recover from GN reboot

Recover from HN reinstall

Install the HN and WNs following the Rocks installation instructions. Using the Rocks boot disk instead of PXE boot for the WNs has a higher probability of success. Forcing the default partitioning scheme for the WNs also has a higher probability of success. Don't forget to power cycle the switch and configure it via a web browser.
Copy the backed up critical files to /root/backup. Make sure the read/write permissions are set correctly for each file. As root (su -):
cd /root/backup
Create at least one new user.
Configure security following the instructions in security.txt.
Copy the info and certificate files to the correct directory:
cp security.txt ../.
cp network-ports.txt ../.
cp configure-external-network.sh ../.
cp hepcms-0cert.pem ../.
cp hepcms-0key.pem ../.
cp http-hepcms-0cert.pem ../.
cp http-hepcms-0key.pem ../.
Follow the instructions in this How-To guide to change the WN partitions (if necessary), mount the big disk, place the WNs on the external network and install xemacs and emacs on both the HN and WNs. Instructions which call for rocks-dist dist (and the accompanying shoot-node) can be stacked. Shoot the nodes (re-install the WNs) once after configuring Rocks for the disks, network and emacs. Then install all the software (the CRAB & PhEDEx nodes must be shot one more time). A few notes:
1. Backed up copies of many of the modified files should already be made, so there should be very few manual file edits. Be sure to save the original files in case of failure.
2. The boot order of the WNs may have changed, so the Rocks name assignment may correspond to a different physical node. The external IP addresses map to an exact patch panel port number, so move the network cables to the correct port on the patch panel. Use /root/network-ports.txt as your guide and be sure to modify it with the new switch port numbers (or move the switch port cables if you prefer -- the switch doesn't care). You may also want to modify the LEDs displaying the Rocks internal name, which can be done at boot time (strike F2 during boot to get to setup), under "Embedded Server Management."

Recover from GN reinstall

Although a Rocks appliance, the grid node is never intended to be reinstalled via Rocks kickstart. It is installed once from Rocks kickstart and all subsequent installs are done from its command line. If issuing a shoot-node on the grid node is absolutely necessary, the relevant software and hardware which must be reconfigured is:

The big disk array
CMSSW
OSG
PhEDEx

Solutions to encountered errors

Errors are organized by the program which caused them:

RAID
Rocks
Condor
Logical volume (LVM)
CMSSW
gcc/g++/gcc4/g++4
Dell OpenManage
YUM
gLite
srm
OSG/RSV
SiteDB/PhEDEx

RAID

During HN boot:
Foreign configuration(s) found on adapter.
Followed by:
1 Virtual Drive(s) found
1 Virtual Drive(s) offline
3 Virtual Drive(s) handled by BIOS
This Dell troubleshooting guide is a useful resource. In our case, this occurred because we booted the HN before the disk array had fully powered up. We believe this also corrupted the PERC-6/E RAID controller configuration. Upon subsequent shut down of the HN, full disk array power-up, followed by powering the HN again, we loaded the foreign configuration (pressed the key f). The RAID controller can also be configured again using the configuration utility (c or Ctrl+r).

Rocks

NameError: global name 'FileCopyException' is not defined (inspection of other terminals shows that comps.xml is missing)
Rocks 4.3 needs a special 'comps' roll for SL4.5. It must be downloaded, placed on disk, and selected as a roll at install time.
An error occurred when attempting to load an installer interface component className=FDiskWindow"
Rocks is complaining that the partition table in the kickstart file is incorrect. Check /home/install/site-profiles/4.3/nodes/replace-auto-partition.xml for syntactic problems (Beware! You may lose existing data!). If your system is having very serious partition issues, or this file does not exist, try these instructions to force the default Rocks partitioning scheme. Once replace-auto-partition.xml is repaired, issue the rocks-dist dist command from the /home/install directory. Depending on your situation, you may need to force the nodes to load the new kickstart file.
After a WN installs successfully, it reboots with the error:
mkrootdev: label /1 not found
Mounting root filesystem
mount: error 2 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!
This error can occur when using non-default partitioning on the WNs and is due to disk LABEL synchronization issues. The Rocks authors have seen this error before, but are unable to reproduce the conditions which cause it to occur. In order to prevent failures of this type from occurring both when first attempting to use non-default partitioning and when calling shoot-node after successful reinstall, add the following to /home/install/site-profiles/4.3/nodes/extend-compute.xml in the <post> section:
e2label /dev/sda1 /
cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
cp -f /tmp/fstab /etc/fstab
cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
> /tmp/grub.conf
cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
chmod +w /boot/grub/grub-orig.conf
unlink /boot/grub/grub.conf
ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf
This will force all files which use disk LABELs to be in agreement with one another. It also makes grub-orig.conf writeable so Kernel updates can modify the file and makes the symlink a full path instead of relative, which also causes problems with Kernel updates. Be sure to create the new rocks distribution:
cd /home/install
rocks-dist dist
And use the Rocks boot disk to get the WN to reinstall - we've found that PxE boot is not consistently successful in the event of a Kernel panic.
shoot-node gives errors:
Waiting for ssh server on [compute-0-1] to start
ssh: connect to host compute-0-1 port 2200: Connection refused
...
Waiting for VNC server on [compute-0-1] to start
Can't connect to VNC server after 2 minutes
ssh: connect to host compute-0-1 port 2200: Connection refused
...
main: unable to connect to host: Connection refused (111)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/scratch/home/build/rocks-release/rocks/src/roll/base/src/foundation-python/foundation-python.buildroot//opt/rocks/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/opt/rocks/sbin/shoot-node", line 313, in run
os.unlink(self.known_hosts)
OSError: [Errno 2] No such file or directory: '/tmp/.known_hosts_compute-0-1'
and examination of WNs reveals they are trying to install interactively (i.e., requesting language for the install, etc.):
This seems to occur most commonly when there is a problem with the .xml files used for the Rocks distribution. The solution which works most consistently is to remove all your modified .xml files in /home/install/site-profiles/4.3/nodes (leave skeleton.xml) and force default partitioning. Then reinstall the WNs -- you will have to manually restart the WNs as they will remain in the interactive install state until manual intervention. Failing this, reinstall the entire cluster, although this will not guarantee success if you use the same .xml files.
shoot-node & cluster-kickstart give the error:
error reading information on service rocks-grub: No such file or directory
cannot reboot: /sbin/chkconfig failed: Illegal seek
This occurs when the rocks-boot-auto package is removed, which prevents WNs from automatically reinstalling every time they experience a hard boot (such as power failure). This error can be safely ignored as it does not actually prevent the node from rebooting and reinstalling from the kickstart when the reinstall commands are manually issued.
Wordpress gives the error:
We were able to connect to the database server (which means your username and password is okay) but not able to select the wordpress database.
the MySQL Rocks web interface says on the left-side bar:
No databases
but Rocks commands still work.
We are unsure what caused this error. We attempted various service restarts to no avail. In the end, rebooting the HN solved the issue. We experienced no apparent Rocks DB corruption as a result of this error.

Condor

Condor job submission works from the HN, but none of the WNs. Errors in the condor include "Permission denied" and "Command not found." Examination of /var/opt/condor/log/StarterLog shows the error:
ERROR: the submitting host claims to be in our UidDomain (UMD.EDU), yet its hostname (compute-0-1.local) does not match. If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name. To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.
This occurs because jobs are submitted via the local network, so the submitting node has the name compute-x-y.local, instead of HEPCMS-X.UMD.EDU. The easiest fix is to set TRUST_UID_DOMAIN = True in the /opt/condor/etc/condor_config.local files on both the HN and WNs. Instructions are outlined here.

LVM

Insufficient free extents (2323359) in volume group data: 2323360 required (error is received on command lvcreate -L 9293440MB data).
Sometimes it is simpler to enter the value in extents (the smallest logical units LVM uses to manage volume space). Use a '-l' instead of '-L' and specify the maximum number of free extents (provided by the error):
lvcreate -l 2323359 data

CMSSW

cmsRun works on the HN, but none of the WNs.
Note this error could be caused by any number of issues; when we encountered this error, it was because our 'one-time' CMSSW install had the environment variable VO_CMS_SW_DIR set to the real directory where the data resides on the HN, rather than the network mounted directory (with a different name) that 'points' to the real directory. For example, the physical partition where CMSSW is installed on the GN is /scratch/cmssw. However, the network mounted directory is named /sharesoft/cmssw. Set VO_CMS_SW_DIR to /sharesoft/cmssw, rather than /scratch/cmssw. We found removing the contents of /scratch/cmssw prior to complete re-install helped.
E: Sub-process /sharesoft/cmssw/slc4_ia32_gcc345/external/apt/0.5.15lorg3.2-CMS19c/bin/rpm-wrapper returned an error code (100)
This link suggests that it is due to a lack of disk space in the area where you are installing CMSSW. However, because we install in /sharesoft and /sharesoft is auto-network mounted, the size of /sharesoft doesn't print until it's been explicitly ls'ed or cd'ed. When RPM checks that there is enough space in /sharesoft to install, it fails. When executing apt-get, add the option:
apt-get -o RPM::Install-Options::="--ignoresize" ...
error: unpacking of archive failed on file /share/apps/cmssw/share/scramdbv0: cpio: mkdir failed - Permission denied
This error occurs because both bootstrap.sh and the CMSSW apt-get install create a soft-link to the 'root' directory where CMSSW is being installed. In our case, since we first tried to install CMSSW to /share/apps (automatically network mounted by Rocks), the soft link is named share. However, CMSSW also has a true subdirectory named share and does write files to this directory. The soft link overrides the true directory and resultantly, CMSSW tries to install to /share, where it does not have permission. In short, CMSSW cannot be installed to any directory named /share, /common, /bin, /tmp, /var, or /slc4_XXX. Follow the CMSSW installation guide for directions on network mounting /scratch/cmssw as /sharesoft/cmssw.
apt-get update issues the error:
E: Could not open lock file /var/state/apt/lists/lock - open (13 Permission denied)
E: Unable to lock the list directory
Be sure to first source the scram apt info:
source $VO_CMS_SW_DIR/$SCRAM_ARCH/external/apt/0.5.15lorg3.2-CMS19c/etc/profile.d/init.csh

gcc/g++/gcc4/g++4

Attempts to compile code gives errors about missing libraries, including:
stddef.h: No such file or directory
bits/c++locale.h: No such file or directory
bits/c++config.h: No such file or directory
This could be caused by any number of issues. In our case, the gcc-c++ and gcc4-c++ packages needed the libstdc++-devel package. To install it:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/libstdc++-devel-3.4.6-8.x86_64.rpm"

Dell OpenManage

srvadmin-install.sh gives the error:
libstdc++.so.5 is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5(GLIBCPP_3.2) is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5(GLIBCPP_3.2.2) is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5 is needed by srvadmin-rac5-components-5.5.0-364.i386
While we have compat-libstdc++-33-3.2.3-47.3.x86_64 installed, Dell needs the i386 version. Get it by:
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm"
rpm -ivh compat-libstdc++-33-3.2.3-47.3.i386.rpm

YUM

Transaction Check Error: file /etc/httpd/modules from install of httpd-2.0.52-38.sl4.2 conflicts with file from package tomcat-connectors-1.2.20-0
We encountered this error on a call to yum update, which was needed for our gLite UI installation. We removed the tomcat-connectors package and encountered no further issues. We also removed the tomcat5 package for extra measure, but that may not be necessary.
yum remove tomcat5
yum remove tomcat-connectors
yum clean all
yum update

gLite

Error: Missing Dependency: perl(SOAP::Lite) is needed by package glite-data-transfer-api-perl
Error: Missing Dependency: perl(SOAP::Lite) is needed by package glite-data-catalog-api-perl
gLite UI requires the SOAP Lite Perl. SOAP is a difficult install due to the sheer quantity of dependencies on other packages. An excellent dependency resolver is available from RPMforge and makes the SOAP install a breeze. These instructions are for our particular OS and architecture:
cd /usr/src/redhat/RPMS/x86_64
wget "http://packages.sw.be/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm"
rpm -Uhv rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm
cd ../noarch
wget "http://dag.wieers.com/rpm/packages/perl-SOAP-Lite/perl-SOAP-Lite-0.71-1.el4.rf.noarch.rpm"
yum localinstall perl-SOAP-Lite-0.71-1.el4.rf.noarch.rpm
Note: This error will only occur if you are attempting to do the apt-get style installation of gLite-UI. The tarball installation of gLite-UI is self-contained and you should not encounter this error, nor need RPMforge.

SRM

srmcp issues the error:
GridftpClient: Was not able to send checksum
value:org.globus.ftp.exception.ServerException: Server refused
performing the request. Custom message: (error code 1) [Nested
exception message: Custom message: Unexpected reply: 500 Invalid
command.] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
message: Unexpected reply: 500 Invalid command.]
but the file transfer is successful.
This error occurs because srmcp is an srm client developed by dCache with the special added functionality of a checksum. BeStMan uses the LBNL srm client and does not support srmcp checksum functionality, nor does Globus gridftp. This error can be safely ignored.

OSG/RSV

After HN boot:
RSV jobs are in the condor production queue instead of the condor development queue.
This bug occurs only when running OSG 0.8.0 with the upgraded RSV V2. For a permanent fix, install RSV V2 as a standalone product, rather than an upgrade to RSV V1 or install OSG 1.0. Alternatively, you can fix this issue every time the HN reboots. As root (su -) on the HN:
1. Stop the osg-rsv service:
  cd /sharesoft/osg/ce
  . setup.sh
  vdt-control --off osg-rsv
2. Kill the jobs in the condor production queue:
  condor_q
  condor_rm #, where # is the batch number of any jobs being run by rsvuser
3. Restart the osg-rsv service and check the queues:
  vdt-control --on osg-rsv
  condor_q (to check the production queue)
  su - rsvuser
  condor_q (to check the development queue)
4. If jobs are not in the production queue, but are in the development queue, the procedure has worked. Otherwise, we've found repeating these steps several times seems to work.
The RSV cacert-crl-expiry-probe fails with an error to the effect of:
/sharesoft/osg/ce/globus/TRUSTED_CA/1d879c6c.r0 has expired! (nextUpdate=Aug 15 14:28:32 2008 GMT)
and voms-proxy-init fails with the error:
Invalid CRL: The available CRL has expired
This can occur because, for one reason or another, the last cron jobs which should have renewed the certificates did not execute or complete for that particular CA in time. You can manually run the cron jobs by first searching for them in cron:
crontab -l | grep cert
crontab -l | grep crl
then executing them:
/sharesoft/osg/ce/vdt/sbin/vdt-update-certs-wrapper --vdt-install /sharesoft/osg/ce
/sharesoft/osg/ce/fetch-crl/share/doc/fetch-crl-2.6.2/fetch-crl.cron
If fetch-crl.cron prints errors about "download no data from... persistent errors.... could not download any CRL from...", ignore them as long as voms-proxy-init works when fetch-crl.cron completes.
configure-osg.py -c chokes and vdt-install.log says:
##########
# configure-osg.py invoked at Tue Oct 28 15:36:56 2008
##########
### 2008-10-28 15:36:56,792 configure-osg ERROR In RSV section
### 2008-10-28 15:36:56,792 configure-osg ERROR Invalid domain in
gridftp_hosts: UNAVAILABLE
### 2008-10-28 15:36:56,792 configure-osg CRITICAL Invalid attributes
found, exiting
########## [configure-osg] completed
This is because gridftp_hosts does not actually accept UNAVAILABLE (despite the comments). Simply set gridftp_hosts=%(localhost)s in config.ini and try running configure-osg.py -c again. RSV will be rolling out a fix for this very soon (as of October 31, 2008).
MyOSG GIP tests give the error when using gridmapfile in the OSG CE:
GLUE Entity GlueSEAccessProtocolLocalID does not exist
CEMon now gets information for BDII by issuing various srm commands using your http host cert. The distinguished name (DN) of your http host cert needs to be added to your grid-mapfile-local and mapped to a user account.

SiteDB/PhEDEx

After attempting to log in to PhEDEx via certificate, a window pops up several times requesting your grid cert (already imported into your browser) and after multiple OK's, eventually goes to a page with the message:
Have You Signed Up?

You need to sign up with CMS Web Services in order to log in and use privileged features. Signing up can be done via SiteDB.

If you have already signed up with SiteDB, it is possible that your certificate or password information is out of date there. In that case go back to SiteDB and update your information.

For your information, the DN your browser presents is:

/DC=something/DC=something/OU=something/CN=Your Name ID#
This problem occurs when your SiteDB/hypernews account is not linked with your grid certificate. Go to the SiteDB::Person Directory (SiteDB only works in the Firefox browser), login with your hypernews account and follow the link under the title labeled "Edit your own details here". In the form entry box titled "Distinguished Name", enter the DN info displayed earlier and click on the "Edit these details" button. You should then be able to login to PhEDEx with your grid certificate in 30-60 minutes.

UMD HEP T3 Computing Cluster

Admin How-To Guide

This guide is now deprecated. As of Dec. 31, 2009, a newer guide for SL5 is now available. This page will be moved to our archives soon, so please update your links.

Table of Contents

Connect to the switch

Direct serial connection

1. The VT100 emulator:

2. Settings for serial console:

3. Initial setup:

Using a graphical browser:

Install Rocks

Modify Rocks

Modify cluster database

Prevent automatic re-install

Non-HN re-installation

Modify non-HN partitions

Configure external network for all other nodes:

Add new users

Modify users

Add rolls

Create appliances

To create the grid appliance:

To create the interactive appliance:

Update RPMs

Upgrade RAID firmware & drivers

Configure the big disk array

Create, format & mount the disk array on the GN:

Network mount the disk array on all the nodes

Instrument & monitor

Install OpenIPMI on the HN:

Configure the WN & IN BMCs:

Install & configure OMSA on the HN & GN

Install & configure OMSA on the WNs & INs:

Install CMSSW

Prepare the environment:

Install Squid

Install a CMSSW release:

Uninstall a CMSSW release

Install CRAB

Install OSG

Request host certificates:

Install and configure the CE, BeStMan, and the WN client

Start the CE & SE

Register with the Grid Operations Center (GOC):

Install PhEDEx

Site registration

Install on the GN

Get proxy & start services

Clean Logs:

Commission links:

Install/configure other software

RPMforge:

xemacs/emacs:

Pacman:

Kerberos:

CVS:

Subversion:

cron garbage collection:

Condor

Backup critical files

Recover from failure

Power down and up procedures

Recover from HN reboot

Recover from GN reboot

Recover from HN reinstall

Recover from GN reinstall

Solutions to encountered errors

RAID

Rocks

Condor

LVM

CMSSW

gcc/g++/gcc4/g++4

Dell OpenManage

YUM

gLite

SRM

OSG/RSV

SiteDB/PhEDEx