Admin How-To Guide
This guide is now deprecated. As of Dec. 31, 2009, a newer guide for SL5 is now available. This page will be moved to our archives soon, so please update your links.
This guide is meant for UMD admins primarily and as a documented single use case of a USCMS Tier 3 site. It is based off our cluster configuration and hardware, documented here.
We do not call out where you might need to change your syntax, so if you are a non-UMD admin, we recommend you follow the guides linked at the beginning of each set of instructions and reference this guide to see what choices we made. We have not necessarily followed "best setup" practices and you use this to setup or modify your cluster at your own risk.
Dependencies are listed at the beginning of every set of instructions. Instructions are ideally (but not in practice) updated every time we install a new software release. If any links have expired, any errors are found, or some points are unclear, please notify the System administrators.
In mid-August 2009, we performed numerous software updates as well as changing our hardware configuration. We have kept an archived copy of the old admin guide as well as the old hardware configuration.
Last edited November 8, 2012
Table of Contents
- Connect to the switch
- Install Rocks 4.3 with SL4.5
- Modify Rocks
- Upgrade RAID firmware & drivers
- Configure the big disk array
- Instrument & monitor
- Install CMSSW
- Install CRAB
- Install OSG
- Install PhEDEx
- Install/configure other software
- Backup critical files
- Recover from HN failure
- Solutions to encountered errors
Connect to the switch
This is a guide intended for basic setup in a Rocks cluster. The Dell 6224 (a rebranded Cisco) is a fully managed switch, meant for use in a larger switching fabric, so it has many powerful features (most of which will not be covered here). The specific configuration details may vary, depending on your local environment. If in doubt, please consult your local network administrator. We first connect to the switch via a direct serial connection to get it to issue DHCP requests. We then get Rocks to listen to the DHCP request and assign an IP address, then do final configuration via a web browser.
In addition to the information we provide, all of the Dell 6224 manuals can be downloaded here.
Direct serial connection
1. The VT100 emulator:
First, connect the switch and headnode (or computer of choice), using the manufacturer supplied serial cable. A terminal program, such as 'minicom' (available in most Linux distros) can be used to talk to the switch. It must be noted here, that we were unable to get our headnode to communicate with the switch over the serial console using minicom, so instead a laptop w/ serial port running Linux was used (this is a local anomaly, and should not be considered a default).
Alternative terminal programs for serial console:
- Windows = Hyperterminal (available in all distributions)
- Linux w/ GUI = gtkterm (available in most distros (except SL); if not, it is easily found)
2. Settings for serial console:
The most common configuration for asynchronous mode are used: 8-N-1.
8 = 8 data bits
N = no parity bits
1= 1 stop bit
Most console programs will default to these settings. Additionally, the communication speed should be set to at least 9600 baud.
3. Initial setup:
Power on the switch and wait for startup to complete. The Easy Setup Wizard will display on an unconfigured switch. These are the important points:
- Would you like to set up the SNMP management interface now? [Y/N] N
Choose no. (unless you have centralized Dell OpenManage, or other management) - To set up a user account: The default account is 'admin', but anything may be used.
- To set up an IP address: Choose 'DHCP', as Rocks will handle address assignments in the cluster.
- Select 'Y' to save the config and restart.
We also experimented with dividing certain types of traffic into separate VLANs. It was deemed unnecessary, given the present size of our cluster, but may be revisited should we add considerably more nodes, or if network traffic control proves problematic.
4. Network connections:
Now get Rocks to recognize the DHCP request issued by the switch by proceeding with step 9 of the Rocks installation instructions. In short, after Rocks has been installed on the HN:
insert-ethers
Select 'Ethernet switches'
Wait at least 30 mins after powering the switch for it to issue the DHCP request
After Rocks assigns an IP to the switch, it can be configured over telnet, SSH, and HTTP, from the headnode. The default name for the switch is network-0-0.
Using a graphical browser:
As outlined in step 9 of the Rocks installation instructions, the Spanning Tree Protocol (STP) must be disabled. It is often recommended to configure STP, which we did initially. We could not get the worker nodes to pull an address from DHCP. After some experimentation, all ports on the switch were set to 'portfast' mode, which solved the problem. However, this is essentially the same as turning STP off completely, which also works just fine. The problem is that links will go up and down a few times during the DHCP request, and STP won't properly activate a port until it has been up for several seconds. So, Rocks would never see the end nodes. This can be done from the command line, but it is simpler to use the web-enabled interface from a browser on the headnode (or over x-forwarding from the command line).
From the head node, open a graphical browser and enter the IP address: 10.255.255.254. The user name and password can be given by the System administrators. This is a semi-dynamically allocated IP, so in rare cases, the IP may be re-assigned. If this IP does not connect you to the switch, issue the command 'dbreport dhcpd' and look for the network-0-0.local bracket, where the local IP address will be listed. If the network-0-0.local bracket does not exist, a portion of the Rocks install must be redone (see "Install Rocks" below, instruction 9). Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
If, for some reason, the browser method doesn't work, type these commands at the VT100 console provided by minicom or similar software:
console#config
console(config)#spanning-tree disable <port number>
(this will have to be done for all 24 ports!)
console(config)#exit
console#show spanning-tree
Spanning tree Disabled mode rstp
console#quit
Install Rocks
These instructions are for installing Rocks 4.3 using Scientific Linux 4.5, x86_64 architecture, adding the condor roll. Rocks downloads are available here, SL4.5 is available here. The Rocks 4.3 user's guide is available here.
- Download the Rocks Kernel/Boot roll
- Download the Rocks Core roll (includes the required rolls of base, hpc, and web-server, with a few nice extras)
- Download the Rocks Condor roll
- Download Scientific Linux 4.5 (all disks)
- Download a special SL4.5 patch for Rocks, labeled the comps roll.
- Burn all the .iso's to disks (Windows .iso burner, BurnCDCC).
- Follow the Rocks 4.3 user's guide to install Rocks on the head node. Additions to the guide:
- Our network configuration is detailed here. The initial boot phase is on a timer and will terminate if you do not enter the network information quickly enough.
- We selected the base, ganglia, hpc, java, and web-server rolls from the Core CD. We believe the grid roll may actually be counter-productive, as it attempts to set up the cluster as a certificate authority, which may interfere with the OSG configuration.
- Be sure to add the kernel, comps and condor rolls.
- Insert each SL4.5 disk in turn and select the LTS roll listed.
- As far as we know, the questions about certificate information on the "Cluster Information" screen is not used by any applications that we install. We entered the following, which may or may not be correct:
FQHN: HEPCMS-0.UMD.EDU (originally, now hepcms-hn.umd.edu)
Name: UMD HEP CMS T3
Certificate Organization: DOEgrids
Certificate Locality: College Park
Certificate State: Maryland
Certificate Country: US
Contact: mtonjes@nospam.umd.edu (w/o the nospam)
URL: http://hep-t3.physics.umd.edu
Latitude/Longitude: 38.98N -76.92W
- Select manual partitioning and allocate the following partition table (if you wish to preserve existing data, be sure to restore the partition table and don't modify any you wish to keep):
/dev/sda :
/ 8189 /sda1 ext3
swap 8189 /sda2 swap
/var 4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch 48901 /sda5 ext3 (fill to max available size)
/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/export 418168 /sdb1 ext3 (fill to max available size)
Leave /dev/sdc (the big disk array) alone as it is a logical volume and Rocks cannot handle logical volumes at the install stage. - In some cases, Rocks does not properly eject the boot disk before restarting. Be sure to eject the disk after Rocks is done installing, but before the reboot sequence completes and goes to the CD boot.
- On your first login to the HN, you will be prompted to generate rsa keys. You should do so (the default file is fine, as well as using the same password).
- Read the Rocks 4.3 user's guide on how to change the partition tables on the worker nodes. Note the code below may not work if you have existing partitions on any of the WNs. Rocks tries to preserve existing partitions when it can. If the code below does not work (symptoms include pop-ups during install complaining about FDiskWindow & Kernel panics from incorrectly synced configs after install is complete), try forcing the default partitioning scheme & modifying the Rocks WN partitions after install. In this case, you will probably lose any existing data on the WNs and should use the Rocks boot disk rather than PXE boot. Additionally, our setup somehow causes LABEL synchronization issues on subsequent calls to shoot-node; we must add some commands to extend-compute.xml to fix this issue. The necessary commands to set the WN partitions prior to the first WN Rocks installation:
- cd /home/install/site-profiles/4.3/nodes/
- cp skeleton.xml replace-auto-partition.xml
- Edit the <main> section of replace-auto-partition.xml:
<main>
<part> / --size 8192 --ondisk sda </part>
<part> swap --size 8192 --ondisk sda </part>
<part> /var --size 4096 --ondisk sda </part>
<part> /scratch --size 1 --grow --ondisk sda </part>
<part> /tmp --size 1 --grow --ondisk sdb </part>
</main> - cp skeleton.xml extend-compute.xml
- Edit the <post> section of extend-compute.xml and add:
e2label /dev/sda1 /
cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
cp -f /tmp/fstab /etc/fstab
cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
> /tmp/grub.conf
cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
chmod +w /boot/grub/grub-orig.conf
unlink /boot/grub/grub.conf
ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf
- cd /home/install
- rocks-dist dist
- Follow the Rocks 4.3 user's guide to set up the worker nodes.
Additions to the guide:
- If you have not already done so, be sure to configure the switch via the serial cable to get its IP via DHCP and set a login name and password (for internet management).
- We do have a managed switch, so the first task, done by selecting 'Ethernet switches' in the insert-ethers menu, should be performed. The switch takes a long time to issue DHCP requests after powering up; wait at least 30 mins.
- Quit insert-ethers using the F11 key, not F10.
- Once insert-ethers has detected the switch, open an internet browser and log into the switch (typically 10.255.255.254, but dbreport dhcpd lists the switch' local IP inside the network-X-Y bracket). The user name and password can be provided to you by the System administrators.
- Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
- Continue with the remainder of the Rocks WN instructions.
- PXE boot can be initiated on all the WNs by striking the F12 key at the time of boot. Alternatively, insert the Rocks Kernel/Boot CD into each WN shortly after pressing the power button.
- Security needs to be configured (quickly). Instructions to do so are located in the file ~root/security.txt, readable only by root. If this file was lost during Rocks install, contact the System administrators for the backup. If you are another site following these instructions, you can contact the Sysadmins for a copy, on which to start configuration of your local site security (security tends to be site specific and we don't claim our security is fool-proof). Your identity will need to be confirmed by the Sysadmins.
- Be sure to update all the rpm's on all nodes after they are installed.
Modify Rocks
- Modify cluster database
- Prevent automatic re-install
- Non-HN re-installation
- Modify non-HN partitions
- Configure external network for all other nodes
- Add new users
- Modify users
- Add rolls
- Create appliances
- Update RPMs
Modify cluster database
The information stored in the Rocks cluster database can be viewed and edited here, user name and password can be obtained from Marguerite Tonjes. The MySQL DB can be restarted by issuing the command /etc/init.d/mysqld restart from the HN as root (su -).
Prevent automatic re-install
Rocks will automatically re-install all nodes except the HN after they have experienced a hard reboot (such as power failure). This is a useful feature during installation stages, but can be a performance issue once the cluster is in a stable configuration. Simply follow the instructions in this Rocks FAQ to disable this feature. Be sure to re-install the nodes (not the HN) to get the changes to propagate. After removing this feature, shoot-node and cluster-kickstart commands will issue the error:
cannot remove lock: unlink failed: No such file or directory
error reading information on service rocks-grub: No such file or directory
cannot reboot: /sbin/chkconfig failed: Illegal seek
which can be safely ignored.
Non-HN re-installation
Some modifications will require the nodes in the Rocks network to be reinstalled. This tends to be true in cases which require you to issue the command 'rocks-dist dist,' typically because you edited an .xml file. In most cases, this involves simply issuing:
ssh-agent $SHELL
ssh-add
shoot-node compute-0-0
(repeat for all desired nodes)
An alternative method of re-shooting the nodes is shown below. It is not clear which approach is superior.
ssh-agent $SHELL
ssh-add
ssh compute-0-0 'sh /home/install/sbin/nukeit.sh'
(repeat for all desired nodes or use cluster-fork)
ssh compute-0-0 '/boot/kickstart/cluster-kickstart'
(repeat for all desired nodes or use cluster-fork)
If you have not yet made nukeit.sh, see the instructions to modify WN partitions.
OMSA cannot be fully installed as a part of the Rocks Kickstart. Be sure to follow the instructions for OMSA non-HN installation in step 3 after every reinstall.
Since Rocks requires a reinstall of nodes every time a change is made to their kickstart files and we have interactive nodes, you may want to wait until a scheduled maintenance time to reinstall. The cluster-fork command is useful to get the desired functionality prior to reinstall:
ssh-agent $SHELL
ssh-add
cluster-fork "command"
"command" can be anything you'd like run on each WN individually, which could include a network-mounted shell script.
After every major reinstall, in addition to testing whatever changes were made, we like to test a few basic capabilities to make sure nothing was broken. A general outline of the tests we perform:
- We first check that the nodes are reporting to Ganglia. Failure to do so indicates a serious problem, which will probably only be resolved by going to the RDC to examine the hardware and perform another WN reinstall (after tracking the problem down and fixing it).
- We have a "Hello World" C++ program which we compile and run. Failure typically indicates some sort of endemic, low-level problem, which will probably only be solved by another WN reinstall (after you've tracked the problem down and fixed it). Note we do need additional C++ compilers as a part of the Rocks kickstart.
- We have a "vanilla" Condor .jdl file which simply executes sleep. We check both that it ran and that it was submitted to nodes other than the submitting node (submit more than 8 jobs - if they all run simultaneously, the jobs were successfully submitted to more than one node). Failure typically indicates a problem with the condor configuration, controlled via a Rocks kickstart file. It may also indicate an error with the network configuration.
- We have a very simple CMSSW config that generates a handful of events using only Configuration/StandardSequences (no custom C++ code). This CMSSW program uses Frontier conditions to test our Squid server. It also sends output to a variety of locations to test disk mounts. CMSSW and Squid are installed only on the HN, so WN reinstall should not damage the installations. Failure of the CMSSW program may be indicative of a problem with network disk mounts or PATHs. Failure of Squid during cmsRun (which typically prints errors, but does not quit) typically indicates a network problem.
- We have a very simple CMSSW program that analyzes DBS events hosted on the cluster (a basic EDAnalyzer). We do not run the CMSSW program locally. Instead we run the CMSSW job via CRAB, which will test a number of important services all at once. Failure could be due to any number of issues including, but not limited to, gLite, CRAB, or OSG interaction with the WNs. We set the following values in our crab.cfg file:
- pset, output_file : the CMSSW config and name(s) of output file(s)
- se_white_list = UMD.EDU, ce_white_list = UMD.EDU : this tests that we can run jobs in addition to submitting them
- datasetpath : any DBS dataset known to be hosted at the cluster; primarily tests that nodes can access files in the 'file catalog'
- scheduler : we use condor_g the first time for rapid-response debugging. Once the condor_g jobs have completed successfully, we sometimes submit a second CRAB job with glite as the scheduler, particularly if we've made any changes to the gLite-UI install.
- return_data = 0, copy_data = 1, storage_element = hepcms-0.umd.edu, storage_path = /srm/v2/server?SFN=/data/users/srm-drop : tests both our ability to stage-out files using the srm-client from Fermilab and to receive files using the BeStMan server.
Modify non-HN partitions
These instructions are based on this Rocks guide. You will lose any existing data on the node. Additionally, our setup somehow causes LABEL synchronization issues on subsequent calls to shoot-node; we must add some commands to extend-compute.xml to fix this issue.
As root (su -) on the HN:
cd /home/install/site-profiles/4.3/nodes/
cp skeleton.xml replace-auto-partition.xml
If extend-compute.xml does not yet exist:
cp skeleton.xml extend-compute.xml
Edit the <main> section of replace-auto-partition.xml:
<main>
<part> / --size 8192 --ondisk sda </part>
<part> swap --size 8192 --ondisk sda </part>
<part> /var --size 4096 --ondisk sda </part>
<part> /scratch --size 1 --grow --ondisk sda </part>
<part> /tmp --size 1 --grow --ondisk sdb </part>
</main>
Edit the <post> section of extend-compute.xml and add:
e2label /dev/sda1 /
cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
cp -f /tmp/fstab /etc/fstab
cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
> /tmp/grub.conf
cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
chmod +w /boot/grub/grub-orig.conf
unlink /boot/grub/grub.conf
ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf
cd /home/install
rocks-dist dist
rocks remove host partition compute-0-0
(repeat through compute-0-7)
Create the /home/install/sbin/nukeit.sh script:
for i in `df | awk '{print $6}'`
do
if [ -f $i/.rocks-release ]
then
rm -f $i/.rocks-release
fi
done
ssh compute-0-0 'sh /home/install/sbin/nukeit.sh'
(repeat for all desired nodes or use cluster-fork)
ssh compute-0-0 '/boot/kickstart/cluster-kickstart'
(repeat for all desired nodes or use cluster-fork)
In some cases the partitions aren't done properly; it is unclear why. Kernel panics when the node attempts to boot are an indicator of this issue (the node will never reconnect, you must physically go to the node to ascertain this condition). In such a case, it is best to force the default partitioning scheme on these nodes, install, then try again with the preferred partitioning scheme. Use the Rocks boot disk as PXE boot does not seem sufficient. To do the default partitioning scheme, simply replace all the <part> lines in replace-auto-partition.xml with:
<part> force-default </part>
You will lose all data on the nodes for which you force the default scheme.
Configure external network for all other nodes:
- Follow this Rocks guide for activating and configuring the second ethernet interface.
- Use our network configuration to determine the appropriate values to enter. Alternatively, call the script /root/configure-external-network.sh.
- These instructions state how to re-install the nodes.
Add new users
First set the default shell for all new users to tcsh. Edit /etc/default/useradd and change the line SHELL to:
SHELL=/bin/tcsh
This is optional, but commands in this guide assume that root uses a bash shell and all other users use a c-shell.
useradd -c "Full Name" -n username
passwd username (select an initial password)
chage -d 0 username
ssh-agent $SHELL
ssh-add (enter the root password)
rocks sync config
rocks sync users
If the big disk array has already been mounted, give the user their own directory:
mkdir /data/users/username
chown username:users /data/users/username
Some notes:
- User instructions for first-time logging in are given here.
- All the files inside /etc/skel (such as .bashrc & .cshrc) are copied to each new user's /home area. If files in /etc/skel are modified after users have already been made, the existing users need to be informed of the changes. This guide puts environment variables and aliases in /etc/skel so that users can see where important programs are located. Alternatively, environment variables can be placed in /etc/profile (for bash) and /etc/csh.login (for c-shells) and aliases can be placed in /etc/bashrc (for bash) and /etc/csh.cshrc (for c-shells).
Modify users
As root (su -), first utilize standard Linux commands to modify the user (system-config-users provides a GUI if desired). Then update Rocks:
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
Note that to delete a user's home area, you must remove it from /export/home manually. You must also remove the relevant lines in /etc/auto.home:
chmod 744 /etc/auto.home
remove the line with the user's name
chmod 444 /etc/auto.home
make -C /var/411
You should also remove their space in /data. On the GN as root (su -):
rm -rf /data/users/username
Add rolls
Download the appropriate .iso file from Rocks. We'll call it rollFile.iso which corresponds to rollName.
As root (su -):
mount -o loop rollFile.iso /mnt/cdrom
cd /home/install
rocks-dist --install copyroll
umount /mnt/cdrom
rocks-dist dist
kroll rollName | bash
init 6
You can check that the roll installed successfully:
dbreport kickstart HEPCMS-0 > /tmp/ks.cfg
Look in /tmp/ks.cfg for something like:
# ./nodes/somefiles.xml (rollName)
While documentation on this is poor, it seems wisest to re-install the WNs to insure the changes are propagated to the WNs.
Create appliances
We create Rocks appliances for our interactive & grid nodes. Both the interactive & grid appliances do not service condor jobs, though they can submit them, so a special section is dedicated in their Kickstart xml files to configure condor correctly for this case.
These commands are executed as root (su -) on the Rocks head node.
To create the grid appliance:
Our grid node Kickstart file is bare bones because OSG cannot be preserved via tarball for later reinstalls. The grid appliance is not intended for subsequent reinstall, so be sure to configure its partition table, external network interface, and Condor, then reinstall before installing any other software on the grid node. Since the grid node has a different partition table than the compute nodes, we modify its partition table inside grid.xml below.
- Place the files grid.xml in /home/install/site-profiles/4.3/nodes and grid-appliance.xml in /home/install/site-profiles/4.3/graphs/default.
- Create the new Rocks distribution:
cd /home/install
rocks-dist dist - Add an entry for the new grid appliance to the Rocks MySQL database:
rocks add appliance grid membership='Grid Management Node' short-name='gr' node='grid' - Verify that the new XML code is correct:
rocks list appliance xml grid
If this throws an exception, the last line states where the syntax problem is. - Now install the grid node by calling insert-ethers, selecting Grid Management Node, powering up the new node and selecting PXE boot on the new node as it boots.
To create the interactive appliance:
Our interactive node kickstart file is similar to the grid node, except interactive nodes also install gLite-UI & CRAB via Kickstart. Therefore, interactive nodes can be successfully reinstalled via Rocks Kickstart without loss of software.
- Navigate to the gLite-UI tarball repository and select your desired version of gLite-UI. These instructions are for 3.1.28-0, though they can be adapted for other releases. Download the lcg-CA yum repo and tarballs where they can be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0.tar.gz"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0-external.tar.gz" - Navigate to the CRAB download page and select your desired version of CRAB. These instructions are for 2_6_1, though they can be adapted for other releases. Download the tarball where it can be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget --no-check-certificate "http://cmsdoc.cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_6_1.tgz" - Place the files interactive.xml in /home/install/site-profiles/4.3/nodes and interactive-appliance.xml in /home/install/site-profiles/4.3/graphs/default.
- Create the new Rocks distribution:
cd /home/install
rocks-dist dist - Add an entry for the new grid appliance to the Rocks MySQL database:
rocks add appliance grid membership='Interactive Node' short-name='in' node='interactive' - Verify that the new XML code is correct:
rocks list appliance xml interactive
If this throws an exception, the last line states where the syntax problem is. - Now install the interactive node by calling insert-ethers, selecting Interactive Node, powering up the new node and selecting PXE boot on the new node as it boots.
Update RPMs
We choose to call yum -y update on all nodes to update RPMs instead of serving the updated RPMs from the HN during Kickstart. This increases the time to a fully operational node after reinstall, but saves on human time to track down the many RPMs. Be sure to call yum update on a regular basis on all nodes - you may want to consider creating a cron job to do it, though you'll need to check if a reboot is needed.
- Find out if this update will require a reboot:
yum check-update | grep -i kernel
yum check-update | grep -i selinux-policy
yum check-update | grep -i glibc
A new kernel always requires a reboot and the other two are safest with a reboot. If the grid node will have a kernel update, the xfs rpm appropriate to the new kernel needs to be installed:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/kernel-module-xfs-#.#.#-##.ELsmp-0.4-1.x86_64.rpm" - The tomcat-connectors rpm conflicts with files in the httpd rpm. Check if an update to httpd will occur:
yum check-update | grep -i httpd
If so, remove tomcat-connectors (we'll reinstall it when we're done):
yum -y remove tomcat-connectors - Now update (this process could take quite some time):
yum -y update - If tomcat-connectors was removed, install it again:
rpm -ivh /home/install/rocks-dist/lan/x86_64/RedHat/RPMS/tomcat-connectors-1.2.20-0.x86_64.rpm --force - If appropriate, reboot the node, check that the new Kernel is running, and start OMSA, if it's installed:
reboot
uname -r
srvadmin-services.sh start
Upgrade RAID firmware & drivers
Updating firmware will require shutdown of various services as well as reboot of the HN. Be sure to schedule all firmware and driver updates in advance. The instructions below provide details for handling the big disk array (/data), but do not require that it be configured properly before upgrade; indeed, it is recommended that the RAID firmware and drivers be upgraded before mounting the big disk.
- Go to www.dell.com
- Under support, enter the HN Dell service tag
- Select drivers & downloads
- Choose RHEL4.5 for the OS
- Select SAS RAID Controller for the category
- Select the drivers and firmware for PERC 6/E Adaptor and PERC 6/i Integrated for download.
- Follow the PhEDEx instructions to stop all PhEDEx services.
- Stop OSG services:
/etc/rc3.d/S97bestman stop
cd /sharesoft/osg/ce
. setup.sh
vdt-control --off
- Stop what file services we can:
omconfig system webserver action=stop
/etc/init.d/dataeng stop
cd /sharesoft/osg/ce
. setup.sh
vdt-control --off
/etc/rc.d/init.d/nfs stop
umount /data
cluster-fork "umount /data" - As root (su -) on the HN, install the firmware:
- The firmware download link should go to an executable, which is not the right file to install in Linux. From the executable name and location and by browsing the ftp server, you can extrapolate the location of the READMEs, e.g.:
wget "ftp://ftp.us.dell.com/SAS-RAID/R216021.txt"
wget "ftp://ftp.us.dell.com/SAS-RAID/R216024.txt" - By reading the READMEs, you can extrapolate the location of the correct binaries, e.g.:
wget "ftp://ftp.us.dell.com/SAS-RAID/RAID_FRMW_LX_R216021.BIN"
wget "ftp://ftp.us.dell.com/SAS-RAID/RAID_FRMW_LX_R216024.BIN" - Make the binaries executable:
chmod +x RAID_FRMW_LX_R216021.BIN
chmod +x RAID_FRMW_LX_R216024.BIN - Follow the instructions in the READMEs.
- Reboot after each firmware upgrade is complete, stopping all relevant services each time the HN comes back up.
- The firmware download link should go to an executable, which is not the right file to install in Linux. From the executable name and location and by browsing the ftp server, you can extrapolate the location of the READMEs, e.g.:
- As root (su -) on the HN, install the driver:
- The driver download link should go to the README. From the README name and location and by browsing the ftp server, you can extrapolate the location of the tarball, e.g.:
wget "ftp://ftp.us.dell.com/SAS-RAID/megaraid_sas-v00.00.03.21-4-R193772.tar.gz" - Unpack the tarball:
tar -zxvf megaraid_sas-v00.00.03.21-4-R193772.tar.gz - Print the current status:
modinfo megaraid_sas
- Install the appropriate rpms:
rpm -ivh dkms-2.0.19-1.noarch.rpm
rpm -ivh megaraid_sas-v00.00.03.21-4.noarch.rpm - Print the new status (output should have changed):
modinfo megaraid_sas
dkms status - Reboot the HN:
reboot
- The driver download link should go to the README. From the README name and location and by browsing the ftp server, you can extrapolate the location of the tarball, e.g.:
- Reboot all the WNs, as they may have difficulties accessing the network mounted files on the HN:
ssh-agent $SHELL
ssh-add
cluster-fork "reboot" - Be sure to restart the PhEDEx services after WN reboot.
Configure the big disk array
It is recommended but not required that the RAID firmware and drivers are updated prior to configuring the disk array. We chose to use LVM2 on a single partition for the large data array. This will allow for future expansion and simple repartitioning, as the need arises. While it is possible to use 'fdisk' to partition the array, it is not advisable as 'fdisk' does not play nicely with LVM and our total volume size exceeds the 2TB limit. It is also possible to create several smaller partitions and group them together with the 'vgcreate' command, we considered that solution to be overly complicated. We also used the XFS disk format as it is optimized for large disks and works well with Bestman.
Create, format & mount the disk array on the GN:
As root (su -) on the grid node:
- Install XFS:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/kernel-module-xfs-2.6.9-55.ELsmp-0.4-1.x86_64.rpm"
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/xfsprogs-2.6.13-1.SL.x86_64.rpm" - Identify the array's hardware designation with fdisk:
fdisk -l
Our disk array is currently /dev/sdc.
- Use GNU Parted to create the partition:
parted /dev/sdc
At the parted command prompt:
mklabel gpt - This changes the partition label to type GUID Partition Table.
mkpart primary 0 9293440M - This creates a primary partition which starts at 0 and ends at 9293440MB.
print - This confirms the creation of our new partition; output should look similar to:
Disk geometry for /dev/sdc: 0.000-9293440.000 megabytes
Disk label type: gpt
Minor Start End Filesystem Name Flags
1 0.017 9293439.983
quit - Assign the physical volumes (PV) for a new LVM volume group (VG):
pcvreate /dev/sdc1 - Create a new VG container for the PV. Our VG is named 'data' and contains one PV:
vgcreate data /dev/sdc1 - Create the logical volume (LV) with a desired size. The command takes the form:
lvcreate -L (size in KB,MB,GB,TB,etc) (VG name)
So, in our case:
lvcreate -L 9293440MB data
On this command, we receive the error message: Insufficient free extents (2323359) in volume group data: 2323360 required. Sometimes, it is simpler to enter the value in extents (the smallest logical units LVM uses to manage volume space). We will use a '-l' instead of '-L':
lvcreate -l 2323359 data - Confirm the LV details:
vgdisplay
The output should look like:--- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 2 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size 8.86 TB PE Size 4.00 MB Total PE 2323359 Alloc PE / Size 2323359 / 8.86 TB Free PE / Size 0 / 0 VG UUID tcg3eq-cG1z-czIn-7j5a-YVM1-MT70-sqKAUY
- After these commands, the location of the volume is /dev/mapper/data-lvol0 (ascertain by examining the contents of /dev/mapper). Create a filesystem:
mkfs.xfs /dev/mapper/data-lvol0 - Create a mount point, edit /etc/fstab, and mount the volume:
mkdir /data
Add the following line to /etc/fstab:
/dev/mapper/data-lvol0 /data xfs defaults 1 2
And mount:
mount /data - Confirm the volume and size:
df -h
Output should look like:
/dev/mapper/data-lvol0 8.9T 528K 8.9T 1% /data - Create subdirectories and set permissions:
mkdir /data/se
mkdir /data/se/store
cd /
ln -s /data/se/store
mkdir /data/users
For all currently existing users:
mkdir /data/users/username
chown username:users /data/users/username - We create an srm dropbox for our users to transfer files via the srm protocol:
mkdir /data/users/srm-drop
chown root:users /data/users/srm-drop
chmod 775 srm-drop
We use a cron job to garbage collect this directory. Edit /var/spool/cron/root and add the line:
49 02 * * * find /data/users/srm-drop -mtime +7 -type f -exec rm -f {} \;
This will remove week-old files from /data/users/srm-drop every day at 2:49am.
Network mount the disk array on all the nodes
These commands network mounts /data on all nodes. First have the GN export /data. As root (su -) on the GN:
- Edit /etc/exports on the GN as root (su -):
chmod +w /etc/exports
Add this line to /etc/exports: /data 10.0.0.0/255.0.0.0(rw,async)
chmod -w /etc/exports - Restart the GN NFS service:
/etc/init.d/nfs restart - Have the NFS service start on the GN whenever it's rebooted:
/sbin/chkconfig --add nfs
chkconfig nfs on
Now have the HN mount /data and edit the Kickstart file to mount /data on all other nodes. As root (su -) on the HN:
- Edit /etc/fstab on the HN and tell it to get /data from the grid node:
grid-0-0:/data /data nfs rw 0 0 - Have the HN mount /data and make the symlink:
mkdir /data
mount /data
cd /
ln -s /data/se/store - Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and place the following commands inside the <post></post> brackets:
<file name="/etc/fstab" mode="append">
grid-0-0:/data /data nfs rw 0 0
</file>
mkdir /data
mount /data
cd /
ln -s /data/se/store
cd -
Note that given the Rocks node inheritance structure, the grid node will also have its /etc/fstab file appended with this network mount if it's ever reinstalled. However, since reinstalling the grid node via Rocks Kickstart is highly undesirable anyway, we break the model here. If grid node reinstall is absolutely required, after reinstall, this line needs to be removed from the /etc/fstab file on the grid node and the logical volume line in the previous section needs to be used instead. - Create the new distribution:
cd /home/install
rocks-dist dist - Re-shoot the nodes following these instructions.
Instrument & monitor
These steps must be done after inserting the nodes into the Rocks database via insert-ethers. The BMCs on the WNs will issue DHCP requests which will confuse insert-ethers if you try to add both the node and its BMC at the same time. We configure the Baseboard Management Controllers (BMCs) on the WNs to respond to manual ipmish calls from the HN. However, for automation, we opted to have every node self-monitor, so every node also installs Dell's Open Manage Server Administrator (OMSA). We configured the BMCs on the WNs and IPMI on the HN prior to installing OMSA, but BMC documentation suggests it should be possible to configure the BMCs via OMSA. So the steps we took to configure the BMCs, including machine reboot and changing the settings manually on every node, may not be required. We installed OMSA 5.5 on all nodes and OpenIPMI on the HN from the disk which came with our system, packaged with OpenManage 5.3. Dell has not released OpenIPMI specifically for OpenManage 5.5, but we have not experienced any version mismatches by using the older OpenIPMI client.
- Install OpenIPMI on the HN
- Configure the WN & IN BMCs
- Install & configure OMSA on the HN & GN
- Install & configure OMSA on the WNs & INs
Install OpenIPMI on the HN:
At the RDC, start the OS GUI on the HN as root (startx). Insert the Dell OpenManage DVD into the HN drive (labelled Systems Management Tools and Documentation). Install Dell's management station software:
- Navigate to /media/cdrecorder/SYSMGMT/ManagementStation/linux/bmc
- Install:
rpm -Uvh osabmcutil9g-RHEL-3.0-11.i386.rpm - Navigate to /media/cdrecorder/SYSMGMT/ManagementStation/linux/bmc/ipmitool/RHEL4_x86_64
- Install:
rpm -Uvh *rpm - Start OpenIPMI:
/etc/init.d/ipmi start
Configure the WN & IN BMCs:
To configure the BMC's to respond to ipmish command line calls from the HN, reboot each one and configure BIOS and remote access setup.
At boot time, press F2 to enter the BIOS configuration. Set the following:
- Serial communication: On with console redirection via COM2
- External Serial Communication: leave as COM1
- Failsafe Baud Rate: 57600
- Remote Terminal Type: leave as VT100/VT200
- Redirection after Boot: leave enabled
Enter the remote access setup shortly after BIOS boot by typing Ctrl-E. Set the following:
- IPMI Over Lan: On
- NIC Selection: Failover
- LAN Parameters:
- RMCP + Encryption Key: leave
- IP Address Source: DHCP
- DHCP host name: hepcms-hn
- VLAN Enable: leave off
- LAN Alert Enabled: on
- Alert Policy Entry 1: 10.1.1.1
- Host Name String: compute-x-y bmc
- LAN User Configuration: see /root/bmc.txt on the HN (hidden for security)
Before exiting the remote access setup, or as soon as possible afterwards, tell the HN to listen for DHCP requests coming from the BMC. As root (su -) on the HN:
- insert-ethers
- Select Remote Management
- After Rocks recognizes the BMC, exit with the F11 key.
You may need to reboot the WN to get all the new settings to work. To test that it's worked, execute from the HN:
ipmish -ip manager-x-y -u ... -p ... sysinfo
Install & configure OMSA on the HN & GN
Install Dell OpenManage Server Administrator (repeat for the HN & GN):
- Set up the environment:
mkdir /share/apps/OpenManage-5.5
cd /share/apps/OpenManage-5.5
- Download OMSA:
wget "http://ftp.us.dell.com/sysman/OM_5.5.0_ManNode_A00.tar.gz"
tar -xzvf OM_5.5.0_ManNode_A00.tar.gz - Fool OpenManage into thinking we have a valid OS (which we do):
echo Nahant >> /etc/redhat-release - Install OMSA:
cd linux/supportscripts
./srvadmin-install.sh
Choose "Install all" - Start OMSA:
srvadmin-services.sh start - Check it's running and reporting:
omreport system summary
Navigate to https://hepcms-hn.umd.edu:1311 - The files created from unpacking the tarball can be deleted if desired, they were for installation purposes only.
Create the executables which will be called in the event of OMSA detected warnings and failures. We issue notifications via email, including cell phone emails (which can be looked up on your cell phone provider's website):
- Create /share/apps/OpenManage-5.5/warningMail.sh:
# /bin/sh
echo "Dell OpenManage has issued a warning on" `hostname` > /tmp/OMwarning.txt
echo "If HN: https://hepcms-hn.umd.edu:1311" >> /tmp/OMwarning.txt
echo "If WN: use ipmish from HN or omreport from WN" >> /tmp/OMwarning.txt
mail -s "hepcms warning" email1@domain1.com email2@domain2.net </tmp/OMwarning.txt>/share/apps/OpenManage-5.5/warningMailFailed.txt 2>&1 - Create /share/apps/OpenManage-5.5/failureMail.sh:
# /bin/sh
echo "Dell OpenManage has issued a failure alert on" `hostname` > /tmp/OMfailure.txt
echo "Immediate action may be required." >> /tmp/OMfailure.txt
echo "If HN: https://hepcms-hn.umd.edu:1311" >> /tmp/OMfailure.txt
echo "If WN: use ipmish from HN or omreport from WN" >> /tmp/OMfailure.txt
mail -s "hepcms failure" email1@domain1.com email2@domain2.net </tmp/OMfailure.txt>/share/apps/OpenManage-5.5/failureMailFailed.txt 2>&1 - Make them executable and create the error log files:
chmod +x /share/apps/OpenManage-5.5/warningMail.sh
chmod +x /share/apps/OpenManage-5.5/failureMail.sh
touch /share/apps/OpenManage-5.5/warningMailFailed.txt
touch /share/apps/OpenManage-5.5/failureMailFailed.txt
Configure OMSA to handle warnings and failures:
- Navigate to https://hepcms-hn.umd.edu:1311 and log in
- To configure the HN to automatically shutdown in the event of temperature warnings:
- Select the Shutdown tab and the "Thermal Shutdown" subtab
- Select the Warning option and click the "Apply Changes" button
- Under the "Alert Management" tab, set the desired warning alerts to execute application /share/apps/OpenManage-5.5/warningMail.sh.
- Under the "Alert Management" tab, we set the following failure alerts to execute application /share/apps/OpenManage-5.5/failureMail.sh.
- Repeat for the GN (https://hepcms-0.umd.edu:1311).
Install & configure OMSA on the WNs & INs:
We install and configure OMSA via Rocks Kickstart. As root (su -) on the HN:
- Place the appropriate installation files to be served from the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm"
wget "http://ftp.us.dell.com/sysman/OM_5.5.0_ManNode_A00.tar.gz" - Add the text in this xml fragment to the <post></post> section of /home/install/site-profiles/4.3/nodes/extend-compute.xml. If you are performing the OMSA install manually from the command line, you can reference the text in the xml fragment to see the commands executed to perform the install. The xml fragment is effectively a shell script, with & characters replaced by & and > by > .
- Create the new Kickstart:
cd /home/install
rocks-dist dist - Reinstall all the WNs & INs.
- The OMSA install cannot be completed entirely in the Rocks Kickstart.
- Create a shell script which will complete the installation, /home/install/sbin/OMSAinstall.sh:
cd /scratch/OpenManage-5.5/linux/supportscripts
./srvadmin-install.sh -b
srvadmin-services.sh start & - And a shell script which will configure OMSA, /home/install/sbin/OMSAconfigure.sh.
- Make them executable:
chmod +x /home/install/sbin/OMSAinstall.sh
chmod +x /home/install/sbin/OMSAconfigure.sh - And execute them after every WN reinstall:
ssh-agent $SHELL
ssh-add
cluster-fork "/home/install/sbin/OMSAinstall.sh
cluster-fork "/home/install/sbin/OMSAconfigure.sh
- Create a shell script which will complete the installation, /home/install/sbin/OMSAinstall.sh:
Install CMSSW
Production releases of CMSSW can be installed automatically via OSG tools (email Bockjoo Kim to do so). Automatic installs require you prepare your environment and install squid. The remaining instructions are for manual installations and are taken from this guide. Automatic installs also require you to map Bockjoo's grid certificate to the cmssoft account, see the OSG installation guide for details on how to do this with a grid-mapfile.
Prepare the environment:
- Create a user specifically for CMSSW installs, whom we will call cmssoft, following the instructions for adding new users.
- As root (su -) on the grid node, create /scratch/cmssw and cede control to cmssoft:
mkdir /scratch/cmssw
chown -R cmssoft:users /scratch/cmssw
Prepare it to be network mounted by editing /etc/exports and adding the line:
/scratch 10.0.0.0/255.0.0.0(rw,async)
- As root (su -) on the head node, network mount /scratch on the grid node as /sharesoft on all nodes:
- Create /etc/auto.sharesoft file with the content:
cmssw grid-0-0.local:/scratch/cmssw
And change the permissions:
chmod 444 /etc/auto.sharesoft - Edit /etc/auto.master:
chmod 744 /etc/auto.master
Add the line: /sharesoft /etc/auto.sharesoft --timeout=1200
chmod 444 /etc/auto.master - Inform 411, the Rocks information service, of the change:
cd /var/411
make clean
make
- Create /etc/auto.sharesoft file with the content:
- Once /etc/auto.sharesoft has propagated to all the nodes from 411, restart the NFS services on the grid node. As root (su -) on the grid node:
/etc/rc.d/init.d/nfs restart
/etc/rc.d/init.d/portmap restart
service autofs reload
If the NFS service on the GN doesn't already start on reboot, configure that now:
/sbin/chkconfig --add nfs
chkconfig nfs on
- Tell WNs to restart their own auto-NFS service. As root (su -) on the head node:
ssh-agent $SHELL
ssh-add
cluster-fork '/etc/rc.d/init.d/autofs restart'
Note: Some directory restarts may fail because they are in use. However, /sharesoft should get mounted regardless.
- As cmssoft on the grid node (su - cmssoft), prepare for CMSSW installation following these instructions. Some notes:
- Set the correct permissions first:
chmod 755 /scratch/cmssw - We use the VO_CMS_SW_DIR environment variable, as we later set up a link which points the appropriate directories in the OSG app directory to this directory:
setenv VO_CMS_SW_DIR /sharesoft/cmssw
It's important that this environment variable points to the network mount. - We use the same SCRAM_ARCH in the Twiki, e.g.:
setenv SCRAM_ARCH slc4_ia32_gcc345 - You can tail -f the log file to watch the install and check if the bootstrap was successful or to see any errors.
- Set the correct permissions first:
- We want all users to source the CMSSW environment on login according to these instructions. By placing the source commands in the .cshrc & .bashrc skeleton files, all new users will have the source inside their .cshrc & .bashrc files. Existing users will have to add this manually. As root (su -) on the HN, edit /etc/skel/.cshrc to include the lines:
# CMSSW
setenv VO_CMS_SW_DIR /sharesoft/cmssw
source $VO_CMS_SW_DIR/cmsset_default.csh
Similarly, edit /etc/skel/.bashrc:
# CMSSW
export VO_CMS_SW_DIR=/sharesoft/cmssw
. $VO_CMS_SW_DIR/cmsset_default.sh - If OSG has been installed (instructions below are repeated under OSG installation):
- Inform BDII that we have the slc4_ia32_gcc345 environment. Edit /sharesoft/osg/app/etc/grid3-locations.txt to include the lines:
VO-cms-slc4_ia32_gcc345 slc4_ia32_gcc345 /sharesoft/cmssw
VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
(modify X_Y_Z and add a new line for each release of CMSSW installed) - Create a link to CMSSW in the OSG app directory (set during OSG CE configuration inside config.ini):
cd /sharesoft/osg/app
mkdir cmssoft
ln -s /sharesoft/cmssw cmssoft/cms
- Inform BDII that we have the slc4_ia32_gcc345 environment. Edit /sharesoft/osg/app/etc/grid3-locations.txt to include the lines:
Install Squid
The conditions database is managed by Frontier, which requires a Squid web proxy to be installed. We choose to install it on the HN. These instructions are based on these two (1, 2) Squid for CMS guides, be sure to check them for the most recent details.
As root (su -) on the HN:
- First create the Frontier user and give it ownership of the Squid installation and cache directory. As root (su -) on the HN:
useradd -c "Frontier Squid" -n dbfrontier -s /bin/bash
passwd dbfrontier
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
mkdir /scratch/squid
chown dbfrontier:users /scratch/squid - Login as the Frontier user (su - dbfrontier).
- Download and unpack Squid for Frontier (check this link for the latest version):
wget "http://frontier.cern.ch/dist/frontier_squid-4.0rc9.tar.gz"
tar -xvzf frontier_squid-4.0rc9.tar.gz
cd frontier_squid-4.0rc9
- Configure Squid by calling the configuration script:
./configure
providing the following answers:- Installation directory: /scratch/squid
- Network & netmask: 128.8.164.0/255.255.255.192 10.0.0.0/255.0.0.0
- Cache RAM (MB): 256
- Cache disk (MB): 5000
- Install:
make
make install - Start the Squid server:
/scratch/squid/frontier-cache/utils/bin/fn-local-squid.sh start - You can start the Squid server at boot time. As root (su -):
cp /scratch/squid/frontier-cache/utils/init.d/frontier-squid.sh /etc/init.d/.
/sbin/chkconfig --add frontier-squid.sh - Create a cron job to rotate the logs:
crontab /scratch/squid/frontier-cache/utils/cron/crontab.dat - We choose to restrict Squid access to CMS Frontier queries, since the IPs allowed by Squid include addresses not in our cluster. Edit /scratch/squid/frontier-cache/squid/etc/squid.conf and add the line:
http_access deny !CMSFRONTIER
which should be placed immediately before the line:
http_access allow NET_LOCAL
Then tell Squid to use the new configuration:
/scratch/squid/frontier-cache/squid/sbin/squid -k reconfigure - Test Squid with Frontier
- Register your server
To provide new configuration options, call make clean before make to get a fresh install. Be sure to stop the Squid server first (/scratch/squid/frontier-cache/utils/bin/fn-local-squid.sh stop).
We create the site-local-config.xml and storage.xml files as a part of the PhEDEx installation, but they can be created right away. site-local-config.xml should be stored in /sharesoft/cmssw/SITECONF/T3_US_UMD/JobConfig and /sharesoft/cmssw/SITECONF/local/JobConfig while storage.xml should be in /sharesoft/cmssw/SITECONF/T3_US_UMD/PhEDEx and /sharesoft/cmssw/SITECONF/local/PhEDEx. Links provided as a part of the PhEDEx instructions:
- All CMS sites SITECONF directory
- The T3_US_UMD site-local-config.xml
- Twiki about site-local-config.xml
Install a CMSSW release:
- Login as cmssoft to the GN.
- The available CMSSW releases can be listed by:
apt-cache search cmssw | grep CMSSW - Follow these instructions, some notes:
- Be sure to set VO_CMS_SW_DIR & SCRAM_ARCH, get the environment, and update:
setenv VO_CMS_SW_DIR /sharesoft/cmssw
setenv SCRAM_ARCH slc4_ia32_gcc345
source $VO_CMS_SW_DIR/$SCRAM_ARCH/external/apt/<apt-version>/etc/profile.d/init.csh
apt-get update - RPM style options can be specified with syntax such as:
apt-get -o RPM::Install-Options::="--ignoresize" install cms+cmssw+CMSSW_X_Y_Z - This process takes about an hour, depending on the quantity of data you'll need to download.
- You can safely ignore the message "find: python: No such file or directory"
- Be sure to set VO_CMS_SW_DIR & SCRAM_ARCH, get the environment, and update:
- If OSG has been installed:
- Inform BDII that this release of CMSSW is available. As root (su -), edit /sharesoft/osg/app/etc/grid3-locations.txt to include the line:
VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw - Edit the grid policy and home page and add the version installed.
- Inform BDII that this release of CMSSW is available. As root (su -), edit /sharesoft/osg/app/etc/grid3-locations.txt to include the line:
Uninstall a CMSSW release
- Login as cmssoft to the HN.
- List the currently installed CMSSW versions:
scramv1 list | grep CMSSW - If OSG has been installed:
- Add a link to the CMSSW installation in the osg-app directory:
cd /sharesoft/osg/app
mkdir cmssoft
ln -s /sharesoft/cmssw cmssoft/cms - Inform BDII that this release of CMSSW is no longer available. As root (su -), edit /sharesoft/osg/app/etc/grid3-locations.txt and remove the line:
VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw - Edit the grid policy page to remove the version and the home page to announce its removal.
- Add a link to the CMSSW installation in the osg-app directory:
- Remove a CMSSW release:
apt-get remove cms+cmssw+CMSSW_X_Y_Z
Install CRAB
We install CRAB with gLite-UI on the interactive nodes only. We've had problems trying to install gLite-UI via yum on the Rocks HN and are told we shouldn't install it on the OSG CE or SE. Some people have reported no issues with the gLite-UI tarball when they don't install it as root. gLite-UI is necessary to use the glite scheduler in users' crab.cfg, which allows users to submit directly to EGEE (European) sites. Alternatively, gLite-UI does not have to be installed if users set scheduler=condor_g in their crab.cfg and white list the site they wish to submit to. Additionally, the glidein scheduler can be used by CRAB to submit to any Condor GlideIn enabled CrabServer, such as the one at UCSD, which can then send the job on to any OSG or EGEE CMS site. GlideIn comes with Condor, so you do not have to install gLite-UI to get it.
We install CRAB on our INs using a specially created Rocks appliance. Instructions below are for command-line installs and are adapted from four (1, 2, 3, 4) gLite guides, this YAIM guide, and this CRAB guide.
On the installation node as root (su -):
- If you do not already have certificates in /etc/grid-security/certificates, you'll need to download and install the lcg-CA yum repo:
cd /etc/yum.repos.d
wget "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo"
yum install lcg-CA - Navigate to the gLite-UI tarball repository and select your desired version of gLite-UI. These instructions are for 3.1.28-0, though they can be adapted for other releases. Download the tarballs:
mkdir /scratch/gLite
cd /scratch/gLite
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0.tar.gz"
wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL4_i686/glite-UI-3.1.28-0-external.tar.gz"
mkdir glite-UI-3.1.28-0
ln -s glite-UI-3.1.28-0 gLite-UI
cd gLite-UI
tar zxvf ../glite-UI-3.1.28-0.tar.gz
tar zxvf ../glite-UI-3.1.28-0-external.tar.gz - Make your site-info.def file following these (1, 2) instructions. We place our site-info.def in /scratch/gLite/gLite-UI.
- Call YAIM to install and configure gLite-UI using your site-info.def file:
./glite/yaim/bin/yaim -c -s site-info.def -n UI_TAR - gLite-UI has problems with its PYTHONPATH. Edit /scratch/gLite/gLite-UI/external/etc/profile.d/grid-env.sh and add inside the if block:
gridpath_append "PYTHONPATH" "/scratch/gLite/gLite-UI/glite/lib"
gridpath_append "PYTHONPATH" "/scratch/gLite/gLite-UI/lcg/lib" - Navigate to the CRAB download page and select your desired version of CRAB. These instructions are for 2_6_1, though they can be adapted for other releases. Download, install, and configure:
mkdir /scratch/crab
cd /scratch/crab
wget --no-check-certificate "http://cmsdoc.cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_6_1.tgz
tar -xzvf CRAB_2_6_1.tgz
ln -s CRAB_2_6_1 current
cd CRAB_2_6_1
./configure
User instructions for getting the gLite-UI & CRAB environment are here.
Install OSG
These instructions assume you have already installed Pacman, have a personal grid certificate, and have network mounted the big disk array to be used as the SE. The OSG installation and configuration is based on this OSG guide. OSG is built on top of services provided by VDT, so VDT documentation may be helpful to you. These instructions are for OSG 1.2 (our OSG 0.8, OSG 1.0 archives).
We install the worker-node client, the CE, and SE all on the same node (the grid node) and the CE & SE in the same directory. Therefore, we make some configuration choices along the way which might not be applicable for all sites.
- Request host certificates
- Install and configure the CE, BeStMan, and the WN client
- Start the CE & SE
- Register with the GOC
Request host certificates:
Follow these instructions. Some notes:
- Our full hostname for our grid node is hepcms-0.umd.edu
- Enter osg as the registration authority
- Enter cms as our virtual organization (VO)
- Be sure to run the second request for the http certificate
- We make a third request for an rsv certificate. Since we're going to give the rsvuser ownership of the cert, create the user account now. As root (su -) on the HN:
useradd -c "RSV monitoring user" -n rsvuser
passwd rsvuser
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
- Once you've received email confirmation that your certificates are approved and you've followed the instructions to retrieve your certificates, copy the files to the appropriate directories on the GN and give them the needed ownerships:
mkdir -p /etc/grid-security/http
cp hepcms-0cert.pem /etc/grid-security/hostcert.pem
cp hepcms-0key.pem /etc/grid-security/hostkey.pem
cp hepcms-0cert.pem /etc/grid-security/containercert.pem
cp hepcms-0key.pem /etc/grid-security/containerkey.pem
cp http-hepcms-0cert.pem /etc/grid-security/http/httpcert.pem
cp http-hepcms-0key.pem /etc/grid-security/http/httpkey.pem
cp rsv-hepcms-0cert.pem /etc/grid-security/rsvcert.pem
cp rsv-hepcms-0key.pem /etc/grid-security/rsvkey.pem
chown daemon:daemon /etc/grid-security/containercert.pem
chown daemon:daemon /etc/grid-security/containerkey.pem
chown daemon:daemon /etc/grid-security/http/httpcert.pem
chown daemon:daemon /etc/grid-security/http/httpkey.pem
chown rsvuser:users /etc/grid-security/rsvcert.pem
chown rsvuser:users /etc/grid-security/rsvkey.pem
Install and configure the CE, BeStMan, and the WN client
These instructions assume /sharesoft has already been network mounted from the grid node /scratch directory. If it hasn't, instructions under the CMSSW installation give the needed steps.
- Prepare the environment
- Install the compute element
- Configure the CE
- Get the OSG environment
- Configure the grid-mapfile service
- Install & configure the storage element
- Install the worker node client
Prepare the environment: First we need to prepare for the install by creating the appropriate directories, network mounting, and changing our hostname.
- Create the appropriate directories. As root on the GN (su -):
mkdir /scratch/osg
cd /scratch/osg
mkdir wnclient-1.2 ce-1.2
ln -s wnclient-1.2 wnclient
ln -s ce-1.2 ce
ln -s ce-1.2 se
mkdir -p app/etc
chmod 777 app app/etc
mkdir /data/se/osg
chown root:users /data/se/osg
chmod 775 /data/se/osg - Have all nodes (including the GN) mount /scratch/osg on the GN as /sharesoft/osg. Edit /etc/auto.sharesoft on the HN as root (su -) and add the line:
osg grid-0-0.local:/scratch/osg - We use /tmp on the WNs as the temporary working directory for OSG jobs. If you haven't done so already, configure cron to garbage collect /tmp on all of the nodes.
- On a Rocks appliance, the command hostname outputs the local name (in our case, grid-0-0) instead of the FQHN. OSG needs hostname to output the FQHN, so we modify our configuration such that hostname prints hepcms-0.umd.edu following these instructions. Specifically:
- In /etc/sysconfig/network, replace:
HOSTNAME=grid-0-0.local
with
HOSTNAME=hepcms-0.umd.edu - In /etc/hosts, add:
128.8.164.12 hepcms-0.umd.edu - Then tell hostname to print the true FQHN:
hostname hepcms-0.umd.edu - And restart the network:
service network restart
- In /etc/sysconfig/network, replace:
Install the compute element: Install the CE following these instructions. Some notes:
- We install in /sharesoft/osg/ce:
cd /sharesoft/osg/ce - The pacman CE install:
pacman -get http://software.grid.iu.edu/osg-1.2:ce
outputs the messages:
INFO: The Globus-Base-Info-Server package is not supported on this platform
INFO: The Globus-Base-Info-Client package is not supported on this platform
which are safe to ignore. - We use our existing Condor installation as our jobmanager, so execute:
. setup.sh
export VDTSETUP_CONDOR_LOCATION=/opt/condor
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:Globus-Condor-Setup - We also use ManagedFork:
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:ManagedFork
$VDT_LOCATION/vdt/setup/configure_globus_gatekeeper --managed-fork y --server y - Since we run our CE & SE on the same node and various CMS utilities assume the SE is on port 8443, we need to change the ports that some CE services run on.
- Replace 8443 in $VDT_LOCATION/tomcat/v55/conf/server.xml with 7443. The line:
enableLookups="false" redirectPort="8443" protocol="AJP/1.3"
should become:
enableLookups="false" redirectPort="7443" protocol="AJP/1.3" - Edit the file $VDT_LOCATION/apache/conf/extra/httpd-ssl.conf to change port 8443 to port 7443. The lines:
Listen 8443
RewriteRule (.*) https://%{SERVER_NAME}:8443$1
<VirtualHost _default_:8443>
ServerName www.example.com:8443
should become:
Listen 7443
RewriteRule (.*) https://%{SERVER_NAME}:7443$1
<VirtualHost _default_:7443>
ServerName www.example.com:7443
- Replace 8443 in $VDT_LOCATION/tomcat/v55/conf/server.xml with 7443. The line:
- Don't forget to run the post install:
vdt-post-install - We download certs to the local directory, which is network mounted and so readable by all nodes in the cluster:
vdt-ca-manage setupca --location local --url osg
The local directory /etc/grid-security/certificates on all nodes which need access to certs should point to the CE $VDT_LOCATION/globus/share/certificates. E.g., as root (su -) on the GN (needed by OSG services) and interactive nodes (needed by CRAB):
mkdir /etc/grid-security
cd /etc/grid-security
ln -s /sharesoft/osg/ce/globus/share/certificates
The WNs will get certificates by following the symlinks we create in the wnclient directory (installation instructions for WN client below). They do not assume that certificates are at /etc/grid-security/certificates. - *Note: This step may no longer be necessary in OSG 1.2. RSV needs to run in the condor-cron queue instead of the global condor pool because it has many lightweight jobs running constantly. Edit ~rsvuser/.cshrc and add:
source /sharesoft/osg/ce/setup.csh
source $VDT_LOCATION/vdt/etc/condor-cron-env.csh
and edit ~rsvuser/.bashrc and add:
. /sharesoft/osg/ce/setup.sh
. $VDT_LOCATION/vdt/etc/condor-cron-env.sh
Configure the CE: Configure the CE following these instructions. Our config.ini is available here for reference. Note that in OSG 1.2, config.ini is placed in the $VDT_LOCATION/osg/etc directory instead of $VDT_LOCATION/monitoring.
Get the OSG environment: We also have users get the OSG environment on login by editing the .bashrc & .cshrc skeleton files. These will be copied to each new user's /home directory. Existing users (such as cmssoft) will have to add the source commands to their ~/.bashrc & ~/.cshrc files. As root (su -) on the HN:
- Add to /etc/skel/.bashrc:
. /sharesoft/osg/ce/setup.sh - Add to /etc/skel/.cshrc:
source /sharesoft/osg/ce/setup.csh
Configure the grid-mapfile service: We use a grid-mapfile for user authentication. OSG strongly recommends the use of GUMS, however, we encountered great difficulty running GUMS on our Rocks HN. Follow these instructions to configure the grid-mapfile service. Some notes:
- The sudo-example.txt file is located in $VDT_LOCATION/osg/etc.
- To edit /etc/sudoers:
visudo
a
Copy and paste changes, being careful to replace symlinks with full paths.
Esc
:wq!
- The VOs we support can be limited by editing the file $VDT_LOCATION/edg/etc/edg-mkgridmap.conf and removing all lines but those for the mis, uscms01, and ops users. This file can be overwritten on future pacman updates, so check it each time.
- The accounts for each supported VO need to be made. On the HN as root (su -):
useradd -c "Monitoring information service" -n mis -s /bin/true
useradd -c "CMS grid jobs" -n uscms01 -s /bin/true
useradd -c "Monitoring from ops" -n ops -s /bin/true
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users
Setting their shell to true is a security measure, as these user accounts should never actually ssh in. - The grid mapfile file can be remade at any time by executing:
$VDT_LOCATION/edg/sbin/edg-mkgridmap - The http cert will be used by the CE to gather information. It needs to be mapped to a user account following these instructions. The DN->user mapping we add to our grid-mapfile-local is:
"/DC=org/DC=doegrids/OU=Services/CN=http/hepcms-0.umd.edu" uscms01 - The RSV cert will also be used:
"/DC=org/DC=doegrids/OU=Services/CN=rsv/hepcms-0.umd.edu" rsvuser - If the CMSSW environment is ready and you wish to have Bockjoo perform automatic installs, map his DNs to the cmssoft account:
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim (UFlorida T2 Service) 606361" cmssoft
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim 740786" cmssoft
(these DNs were found by doing a grep on the existing grid-mapfile)
Install & configure the storage element: Install BeStMan-Gateway following these instructions. Some notes:
- We install in our /sharesoft/osg/se directory, which is a symlink to our ce installation directory. If you're working in a fresh shell, be sure to source the existing OSG installation:
cd /sharesoft/osg/se
. setup.sh - We use the following configuration settings:
vdt/setup/configure_bestman --server y \
--user daemon \
--cert /etc/grid-security/containercert.pem \
--key /etc/grid-security/containerkey.pem \
--http-port 7070 \
--https-port 8443 \
--globus-tcp-port-range 20000,25000 \
--enable-gateway \
--with-allowed-paths "/tmp;/home;/data" \
--with-transfer-servers gsiftp://hepcms-0.umd.edu
If you call configure_bestman more than once, it will issue the message:
find: /sharesoft/osg/se-1.2/bestman/bin/sharesoft/osg/se-1.2/bestman/sbin/sharesoft/osg/se-1.2/bestman/setup: No such file or directory
Which can be safely ignored. - Don't forget to edit the sudoers file to give daemon needed permissions:
visudo
a
Copy and paste the needed lines
Esc
:wq! - The certificate updater service is already configured to run via the CE, so we don't need to take any special steps for the SE. This is because we installed the SE on the same node and in the the same directory as the CE.
- We use the gsiftp server running via the CE software, so don't need any special configuration options for the SE. This is because we installed the SE on the same node as the CE.
Install the worker node client: Now install the worker-node client as root (su -) on the GN in a fresh shell in /sharesoft/osg/wnclient following these instructions. Some notes:
- Because we install the WN client on the same network mount as the CE, we have the CE handle certificates. This is option 2 in the Twiki.
- The WN client documentation on the OSG ReleaseDocumentation Twiki is out of date as of August 16, 2009. So complete instructions are presented here:
- Install:
cd /sharesoft/osg/wnclient
pacman -allow trust-all-caches -get http://software.grid.iu.edu/osg-1.2:wn-client
You can safely ignore the message:
INFO: The Globus-Base-Info-Client package is not supported on this platform - Get the WN client environment:
. setup.sh - Tell the WN client that we will store certificates in the local directory, specifically /sharesoft/osg/wnclient/globus/TRUSTED_CA:
vdt-ca-manage setupca --location local --url osg - Since we will run our CE on the same node, point the WN client TRUSTED_CA directory to the CE TRUSTED_CA directory:
rm globus/TRUSTED_CA
ln -s /sharesoft/osg/ce/globus/TRUSTED_CA globus/TRUSTED_CA - The original WN client certificate directory can be removed if desired:
rm globus/share/certificates
rm -r globus/share/certificates-1.9
- Install:
- Since the WN client is on the same node as the CE, no services need to be enabled or turned on. It is purely a passive software directory from which WNs can grab binaries and configuration.
Start the CE & SE
As root (su -) on the GN:
- Start the OSG CE & SE:
cd /sharesoft/osg/ce
. setup.sh
vdt-control --on
This starts all the services for both the CE & SE because we installed them in the same directory. - You can perform a series of simple tests to see if your CE has basic functionality. Login to any user account and:
source /sharesoft/osg/ce/setup.csh
grid-proxy-init
cd /sharesoft/osg/ce/verify
./site_verify.pl - The CEmon log is kept at $VDT_LOCATION/glite/var/log/glite-ce-monitor.log.
- The GIP logs are kept at $VDT_LOCATION/gip/var/logs.
- globus & gridftp logs are kept in $GLOBUS_LOCATION/var and $GLOBUS_LOCATION/var/log.
- The BeStMan log is kept in $VDT_LOCATION/vdt-app-data/bestman/logs/event.srm.log.
- Results of the RSV probes will be visible at https://hepcms-0.umd.edu:7443/rsv in 15-30 mins. Further information can be found in the CE $VDT_LOCATION/osg-rsv/logs/probes.
- You can force RSV probes to run immediately following these instructions.
After starting the CE for the first time, the file /sharesoft/osg/app/etc/grid3-locations.txt is made. This file is used to publish VO software pins and should be edited every time a new VO software release is installed or removed. If CMSSW is installed (instructions below are repeated in the CMSSW installation):
- Add a link to the CMSSW installation in the osg-app directory:
cd /sharesoft/osg/app
mkdir cmssoft
chmod 777 cmssoft
chown cmssoft:users cmssoft
- Give cmssoft ownership of the release file:
chown cmssoft:users /sharesoft/osg/app/etc/grid3-locations.txt - As cmssoft (su - cmssoft), create the needed symlink in the OSG APP directory to CMSSW:
cd /sharesoft/osg/app/cmssoft
ln -s /sharesoft/cmssw cms - As cmssoft (su - cmssoft), inform BDII which versions of CMSSW are installed and that we have the slc4_ia32_gcc345 environment. Edit /sharesoft/osg/apps/etc/grid3-locations.txt to include the lines:
VO-cms-slc4_ia32_gcc345 slc4_ia32_gcc345 /sharesoft/cmssw
VO-cms-CMSSW_X_Y_Z CMSSW_X_Y_Z /sharesoft/cmssw
(modify X_Y_Z and add a new line for each release of CMSSW installed)
Register with the Grid Operations Center (GOC):
This should be done only once per site (we've done already). Registration is done at the OSG Information Management (OIM) web portal. Instructions for registration can be found here; you'll need to register yourself and a resource as a new site. We used an older registration process which is no longer used, but for reference, here are the options we selected for resource registration:
- Facility: My Facility Is Not Listed (now that we have registered, we select University of Maryland for any new resources we might add later)
- Site: My Site Is Not Listed (again, now that we have registered, we select umd-cms)
- Resource Name: umd-cms
- Resource Services: Compute Element, Bestman-Xrootd Storage Element
- Fully Qualified Domain Name: hepcms-0.umd.edu
- Resource URL: http://hep-t3.physics.umd.edu
- OSG Grid: OSG Production Resource
- Interoperability: Select WLCG Interoperability BDII (Published to WLCG); do not select WLCG Interoperability Monitoring
(Note: We initially opted to not try to pass SAM tests. At some point, our site was added to CMS SAM tests and we now pass. It's not clear if this option should be selected to have CMS SAM begin testing your site right away.) - GOC Logging: Do not select Publish Syslogng
- Resource Description: Tier-3 computing center. Priority given to local users, but opportunistic use by CMS VO allowed.
Once registration has completed, monitoring info will be here.
Install PhEDEx
We configure PhEDEx to use srm calls directly, instead of 'FTS'. FTS is the most commonly used service by Tier 1 and Tier 2 sites because it tends to be more scalable. FTS requires gLite, which may conflict with an existing CRAB gLite-UI install, so be sure to install PhEDEx on a different node in this case. Regardless, our current installation of PhEDEx does not use gLite. We install PhEDEx on grid-0-0. These instructions are adapted from these (1, 2, 3, 4, 5) PhEDEx guides.
These instructions assume you have already done all the major tasks except for the CRAB install. Specifically, you need to have configured the big disk, created the grid node (via Rocks appliance) and configured its external network connection, and installed OSG. You will also need to have Kerberos configured, CVS installed and configured, and CMSSW installed.
Site registration
Site registration is done only once for a site. These instructions are based on this PhEDEx guide, be sure to consult it for the most recent details. You can register your site in SiteDB prior to OSG GOC registration, however, once OSG GOC registration is complete, you should change your SAM name to your OSG GOC name by filing a new Savannah ticket.
- Create a Savannah ticket with your user public key (usercert.pem) and with the information:
- Site name: UMD
- CMS name: T3_US_UMD
- SAM name: umd-cms (our OSG GOC registration name)
- City/Country: College Park, MD, USA
- Site tier: Tier 3
- SE host: hepcms-0.umd.edu
- SE kind: disk
- SE technology: BeStMan
- CE host: hepcms-0.umd.edu
- Associate T1: FNAL
- Grid type: OSG
- Data manager: Marguerite Tonjes
- PhEDEx contact: Marguerite Tonjes
- Site admin: Marguerite Tonjes
- Site executive: Nick Hadley
- Email the persons listed here and ask them to add our site to the PhEDEx database, including a link to the Savannah ticket (CERN phonebook).
- Once someone has responded to say UMD has been put into SiteDB, go to https://cmsweb.cern.ch/sitedb/sitedb/sitelist/
- Log in with your CERN hypernews user name and password
- Under Tier 3 centres, click on the T3_US_UMD link
- Click on "Edit site information" and specify OSG as our Grid Middleware, our site home page as http://hep-t3.physics.umd.edu and our site logo URL as http://hep-t3.physics.umd.edu/images/umd-logo.gif
- We can also add/edit user information by clicking on "Edit site contacts":
- Click on "edit" to edit an existing user's info
- Click on "Add a person with a hypernews account to site" to add someone new
- Then click on the first letter of the user's last name. Note that many users are listed by their middle name instead of their last.
- Find the user in the list, and click "edit"
- A new page will appear. Click on appropriate values ("Site Admin", "Data Manager",etc.) in the last row of the new page (for the Tier 3), and click "Edit these details" to save.
- Under Site Configuration, select "Edit site configuration":
- CE FQDN: hepcms-0.umd.edu
- SE FQDN: hepcms-0.umd.edu
- PhEDEx node: T3_US_UMD
- GOCDB ID: leave blank
- Install development CMSSW releases?: Do not check
- Site installs software manually?: Check
Install on the GN
These instructions are for PhEDEx 3.2.9, though they can be adapted for later releases.
Prepare for the PhEDEx install. On the HN as root (su -):
- Create the PhEDEx user:
useradd -c "PhEDEx" -n phedex -s /bin/bash
passwd phedex
ssh-agent $SHELL
ssh-add
rocks sync config
rocks sync users - Change ownership of the directory on /data which PhEDEx will use:
chown phedex:users /data/se/store
chmod 775 /data/se/store - And as root on the GN:
mkdir /localsoft/phedex
chown phedex:users /localsoft/phedex
As phedex (su - phedex) on the GN:
- Set up the environment:
cd /localsoft/phedex
mkdir 3.2.9
ln -s 3.2.9 current
cd 3.2.9 - Install PhEDEx following these instructions. Some notes:
- Get the CMSSW libraries in your environment before calling the bootstrap script:
export VO_CMS_SW_DIR=/sharesoft/cmssw
. $VO_CMS_SW_DIR/cmsset_default.sh
- We set myarch=slc4_amd64_gcc345
- apt-cache search won't work until after calling
source $sw/$myarch/external/apt/*/etc/profile.d/init.sh
which will only work after you set the sw & myarch environment variables, as well as downloading and executing the bootstrap script. - We set version=3_2_9
- We use the srm client already installed and network mounted on the OSG CE (we tell PhEDEx to grab the environment in the ConfigPart.Common file).
- We use the JDK already installed and network mounted on the OSG CE. No special modifications to PhEDEx to use it were required.
- Get the CMSSW libraries in your environment before calling the bootstrap script:
- Configure PhEDEx following these (1, 2) instructions. Examples of site configuration can be found here. Our local site configuration can be found here.
Some notes:
- Our site name is T3_US_UMD, so our configuration directories are
$PHEDEX_BASE/SITECONF/T3_US_UMD/PhEDEx
and
$PHEDEX_BASE/SITECONF/T3_US_UMD/JobConfig - We had to modify more than just storage.xml, so be sure to check all the files in the directories for differences from the default templates.
- The JobConfig directory is not actually needed by PhEDEx, it's needed by CMSSW. We choose to put it in our PhEDEx installation area as well (it's harmless).
- CMSSW jobs also need the files in your SITECONF directory. Copy the entire SITECONF directory to the $CMS_PATH directory:
su -
cp -r /localsoft/phedex/current/SITECONF /sharesoft/cmssw/.
cp -r /sharesoft/cmssw/SITECONF/T3_US_UMD /sharesoft/cmssw/SITECONF/local
chown -R cmssoft:users /sharesoft/cmssw/SITECONF
logout
Some sites use different storage.xml files in their $PHEDEX_BASE and $CMS_PATH directories to handle CRAB stage-out of files without a locally installed storage element. Since we have a storage element, ours are the same. - After starting services (detailed in the next section) for the first time, you can test your storage.xml file by:
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
Test srmv2 mapping from LFN to PFN:
/localsoft/phedex/current/sw/slc4_amd64_gcc345/cms/PHEDEX/PHEDEX_3_2_0/Utilities/TestCatalogue -c storage.xml -p srmv2 -L /store/testfile
Test srmv2 mapping from PFN to LFN:
/localsoft/phedex/current/sw/slc4_amd64_gcc345/cms/PHEDEX/PHEDEX_3_2_0/Utilities/TestCatalogue -c storage.xml -p srmv2 -P srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/data/se/store/testfile
Other transfers types can be tested by changing the protocol tag srmv2 to direct, srm, or gsiftp and changing the PFN or LFN argument passed to match. PhEDEx services don't need to be running to do these tests, but the first time PhEDEx is started, it creates some of the needed directories for this test.
- Our site name is T3_US_UMD, so our configuration directories are
- Submit a Savannah ticket for a CVS space under /COMP/SITECONF named T3_US_UMD. Once you receive the space, upload your site configuration to CVS:
/usr/kerberos/bin/kinit -5 username@CERN.CH
cvs co COMP/SITECONF/T3_US_UMD
cp -r /localsoft/phedex/current/SITECONF/T3_US_UMD/* COMP/SITECONF/T3_US_UMD/.
cd COMP/SITECONF/T3_US_UMD
cvs add PhEDEx
cvs add PhEDEx/*
cvs commit -R -m "T3_US_UMD PhEDEx site configuration" PhEDEx - Once your initial registration request is satisfied, you will receive three emails titled "PhEDEx authentication role for Prod (Debug, Dev)/UMD." Copy and paste the commands in the email to the command line. Copy the text output for each into the file /localsoft/phedex/current/gridcert/DBParam. Each text output should look something like (exact values removed for security):
Section Prod/UMD
Interface Oracle
Database db_not_shown_here
AuthDBUsername user_not_shown_here
AuthDBPassword LettersAndNumbersNotShownHere
AuthRole role_not_shown_here
AuthRolePassword LettersAndNumbersNotShownHere
ConnectionLife 86400
LogConnection on
LogSQL off
Get proxy & start services
After reboot of the grid node, the grid certificate and proxy should still be valid, but PhEDEx services aren't configured to start automatically. On the grid node:
- Copy your personal usercert.pem and userkey.pem grid certificate files into ~phedex/.globus and give the phedex user ownership:
chown phedex:users ~phedex/.globus/* - As phedex, create your grid proxy:
voms-proxy-init -voms cms -hours 350 -out /localsoft/phedex/current/gridcert/proxy.cert
Be sure to make note of when the proxy will expire and log on to renew it before then. Some sites will not accept proxies older than a week, so if you have many links, you will probably need to renew your proxy every week. - Now start the services. To be extra safe, each service should be started in a new shell, though in most cases, executing the following in sequence should be OK:
- Start the Dev service instance:
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev start
This service can be stopped by changing the command start to stop. - Start the Debug service instance:
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug start
This service can be stopped by changing the command start to stop. - Start the Prod service instance:
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod start
This service can be stopped by changing the command start to stop.
- Start the Dev service instance:
Clean Logs:
PhEDEx does not clean up its own logs. The first time you start the PhEDEx services, it will create the log files. We use logrotate in cron to clean them monthly, as well as to retain two months of old logs. After starting PhEDEx services at least once on the phedex node:
- Create the backup directories:
mkdir /localsoft/phedex/current/Dev_T3_US_UMD/logs/old
mkdir /localsoft/phedex/current/Debug_T3_US_UMD/logs/old
mkdir /localsoft/phedex/current/Prod_T3_US_UMD/logs/old
- Create the file /home/phedex/phedex.logrotate with the contents (this logrotate guide was helpful):
rotate 2
monthly
olddir old
nocompress
/localsoft/phedex/current/Dev_T3_US_UMD/logs/* {
prerotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev stop
endscript
postrotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Dev environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Dev start
endscript
}
/localsoft/phedex/current/Debug_T3_US_UMD/logs/* {
prerotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug stop
endscript
postrotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Debug environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Debug start
endscript
}
/localsoft/phedex/current/Prod_T3_US_UMD/logs/* {
prerotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod stop
endscript
postrotate
cd /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx
eval `/localsoft/phedex/current/PHEDEX/Utilities/Master -config /localsoft/phedex/current/SITECONF/T3_US_UMD/PhEDEx/Config.Prod environ`
/localsoft/phedex/current/PHEDEX/Utilities/Master -config Config.Prod start
endscript
}
- Run logrotate from the command line to check that it works:
/usr/sbin/logrotate -f /home/phedex/phedex.logrotate -s /home/phedex/logrotate.state - As root (su -), automate by editing /var/spool/cron/phedex and adding the line:
52 01 * * 0 /usr/sbin/logrotate /home/phedex/phedex.logrotate -s /home/phedex/logrotate.state
Which will direct logrotate to run every Sunday at 1:52 as the user phedex. - Additionally, the Prod download-remove agent doesn't clean up its job logs. As root, edit /var/spool/cron/phedex and add the line:
02 00 * * 0 find /localsoft/phedex/current/Prod_T3_US_UMD/state/download-remove/*log -mtime +7 -type f -exec rm -f {} \;
Commission links:
To download data using PhEDEx, a site must have a Production link originating from one of the nodes hosting the dataset. To create each link, sites must go through a LoadTest/link commissioning process. Our Production links to download to our site are listed here. These instructions are adapted from this Twiki.
- The first link you'll want to commission is from the T1_US_FNAL_Buffer. To commission from FNAL, send a request to begin the link commissioning process to hn-cms-ddt-tf@cern.ch. To commission links from other sites, contact the PhEDEx admins for that site as listed in SiteDB (requires Firefox). Ask them if a link is OK and if so, to please create a LoadTest.
- For non-FNAL sites, create a Savannah ticket requesting that the Debug link be made from the other site to T3_US_UMD. Select the data transfers category, set the severity as 3-Normal, the privacy as public and T3_US_UMD as the site.
- PhEDEx or originating-site admins may create the transfer request for you. If they do, follow the link in the PhEDEx transfer request email sent to you to approve the request. If they do not, create the transfer request yourself:
- Go to the PhEDEx LoadTest injection page and under the link "Show Options," click the "Nodes Shown" tab, then select the source node.
- Find T3_US_UMD in the "Destination node" column and copy the "Injection dataset" name.
- Create a transfer request and copy the dataset name into the "Data Items" box. Select T3_US_UMD as the destination. The DBS is typically LoadTest07, but some sites may create the subscription under LoadTest. You will receive an error if you select the wrong one - simply go back and select the other DBS. Leave the drop down menus as-is (replica, growing, low priority, non-custodial, undefined group). Enter as a comment something to the effect of "Commissioning link from T1_US_FNAL_Buffer to T3_US_UMD," then click the "Submit Request" button.
- As administrator for the site, you should be able to approve the request right away, simply select the "Approve" radio button and submit the change.
- Files created by load tests should be removed shortly after they are created.
- To use a cron job that will remove LoadTest files on regular intervals, login to the GN as root (su -), edit /var/spool/cron/root and add the line:
07 * * * * find /data/se/store/PhEDEx_LoadTest07 -mmin +180 -type f -exec rm -f {} \;
37 * * * * find /data/se/store/PhEDEx_LoadTest07 -depth -type d -mmin +180 -exec rmdir --ignore-fail-on-non-empty {} \;
This will remove three hour old PhEDEx load test files every hour at the 7th minute. - Or you can configure the Debug agent to delete files immediately after download. To do this, base your PhEDEx configuration on the T3_US_FNALXEN configuration.
- To use a cron job that will remove LoadTest files on regular intervals, login to the GN as root (su -), edit /var/spool/cron/root and add the line:
- Once load tests have been successful at a rate of >5 MB/sec for one day, the link qualifies as commissioned and PhEDEx admins will create the Production link. If PhEDEx admins don't take note of the successful tests within a week, you can send a reminder to hn-cms-ddt-tf@cern.ch or reply to the Savannah ticket that the link passes commissioning criteria and that you'd like the Prod link to be created.
Install/configure other software
Software which must be usable by the worker nodes should be installed in the head node /export/apps directory. /export/apps is cross mounted across all nodes and is visible by all nodes as the /share/apps directory.
RPMforge:
RPMforge helps to resolve package dependencies when installing new software. It enables RPMforge repositories in smart, apt, yum, and up2date. We use yum. Packages are installed both on the HN and on the WNs, so RPMforge needs to be installed for both. These instructions are adapted from RPMforge and Rocks.
- To install RPMforge on the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://packages.sw.be/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm"
rpm -Uhv rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm - To install RPMforge on the WNs:
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add the following line:
<package>rpmforge-release</package>
Make a new Rocks kickstart distribution:
cd /home/install
rocks-dist dist
Reinstall the WNs.
xemacs/emacs:
Rocks does not install xemacs on any nodes nor emacs on the WNs. These installation instructions below assumed that you have installed RPMforge on the HN to resolve package dependencies. Instructions to install on the WNs are adapted from this Rocks guide. The interactive nodes and grid nodes install emacs via <package type="meta"> tags in their Kickstart files, which installs software bundles.
- Install xemacs on the HN:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-common-21.4.15-10.EL.1.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-21.4.15-10.EL.1.x86_64.rpm"
yum localinstall xemacs-common-21.4.15-10.EL.1.x86_64.rpm
yum localinstall xemacs-21.4.15-10.EL.1.x86_64.rpm - Install xemacs and emacs on the WNs:
"http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/apel-xemacs-10.6-5.noarch.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/FreeWnn-libs-1.10pl020-5.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/Canna-libs-3.7p3-7.EL4.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/xemacs-sumo-20040818-2.noarch.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/emacs-common-21.3-19.EL.4.x86_64.rpm"
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/emacs-21.3-19.EL.4.x86_64.rpm"
Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml by adding the following <package> lines:
<package>Canna-libs</package>
<package>FreeWnn-libs</package>
<package>apel-xemacs</package>
<package>xemacs-sumo</package>
<package>xemacs-common</package>
<package>xemacs</package>
<package>emacs-common</package>
<package>emacs</package>
Create the new Rocks kickstart distribution:
cd /home/install
rocks-dist dist
Re-shoot the WNs.
It is not entirely clear if all these rpm files really must be downloaded (they should come with the SL4.5 release), but the instructions above have been verified to work.
Pacman:
We install Pacman on the HN and GN. Pacman 3.28 or later is required for the BeStMan release which comes packaged with OSG 1.2. As root (su -) on each node:
- Download the latest Pacman:
wget "http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.gz" - Unzip to /usr:
tar xzvf pacman-latest.tar.gz -C /usr
- Source the setup script for the first time:
cd /usr/pacman-x.xx
. setup.sh
- Edit ~root/.bashrc to include the source:
. /usr/pacman-x.xx/setup.sh
Kerberos:
These instructions enable getting kerberos tickets from FNAL and from CERN. User instructions for kerberos authentication are given here.
Configure Kerberos on the HN. As root (su -) on the HN:
- To enable FNAL tickets, save this file as /etc/krb5.conf.
- To enable CERN tickets, add to /etc/krb.conf:
CERN.CH
CERN.CH afsdb1.cern.ch
CERN.CH afsdb3.cern.ch
CERN.CH afsdb2.cern.ch - And add to /etc/krb.realms:
.cern.ch CERN.CH - Configure ssh to use Kerberos tickets:
Make the appropriate file writeable:
chmod +w /etc/ssh/ssh_config
Add the lines to /etc/ssh/ssh_config:
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes
Remove writeability:
chmod -w /etc/ssh/ssh_config
Restart the ssh service:
/etc/init.d/sshd restart - Add to /etc/skel/.cshrc:
# Kerberos
alias kinit_fnal '/usr/kerberos/bin/kinit -A -f'
alias kinit_cern '/usr/kerberos/bin/kinit -5'
- Add to /etc/skel/.bashrc and to ~root/.bashrc:
# Kerberos
alias kinit_fnal='/usr/kerberos/bin/kinit -A -f'
alias kinit_cern='/usr/kerberos/bin/kinit -5'
Configure Kerberos on the WNs. As root (su -) on the HN:
- Copy krb5.conf to where it can be served from the HN during WN install:
cp /etc/krb5.conf /home/install/contrib/4.3/x86_64/RPMS/krb5.conf - Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add to the <post> section:
wget -P /etc http://<var name="Kickstart_PublicHostname"/>/install/rocks-dist/lan/x86_64/RedHat/RPMS/krb5.conf
<file name="/etc/ssh/ssh_config" mode="append">
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes
</file>
<file name="/etc/krb.conf" mode="append">
CERN.CH
CERN.CH afsdb1.cern.ch
CERN.CH afsdb3.cern.ch
CERN.CH afsdb2.cern.ch
<file>
<file name="/etc/krb.realms" mode="append">
.cern.ch CERN.CH
</file> - Create the new Rocks distribution:
cd /home/install
rocks-dist dist - Reinstall the WNs
CVS:
CVS needs to be configured to automatically contact the CMSSW repository using Kerberos-enabled authentication. A Kerberos-enabled CVS client is already installed on the HN, but the WNs use a version of CVS distributed by Rocks, which needs to be updated. While this is a one-time install, we believe it must be done after at least one version of CMSSW has been installed on your system. At the very least, it must be done after the one-time CMSSW install commands. Of course, Kerberos authentication to CERN must also be configured. These instructions also assume that RPMforge is installed on the WNs. These instructions are based on this FAQ.
On the GN as cmssoft (su - cmssoft), install the CMSSW CVS configuration package:
source /sharesoft/cmssw/slc4_ia32_gcc345/external/apt/<version>/etc/profile.d/init.csh
apt-get update
apt-get install cms+cms-cvs-utils+1.0-cms
On the HN as root (su -):
- Download the Kerberos-enabled CVS client for the other nodes:
cd /home/install/contrib/4.3/x86_64/RPMS
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/cvs-1.11.17-9.RHEL4.x86_64.rpm" - Install the Kerberos-enabled CVS on the other nodes. Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and add to the <post> section:
wget "http://<var name="Kickstart_PublicHostname"/>/install/rocks-dist/lan/x86_64/RedHat/RPMS/cvs-1.11.17-9.RHEL4.x86_64.rpm"
yum -y localinstall cvs-1.11.17-9.RHEL4.x86_64.rpm - Create the new distribution:
cd /home/install
rocks-dist dist - Reinstall the non-HN nodes.
User instructions for CVS checkout are given here.
Subversion:
We already had RPMforge (to resolve dependencies) at the time that we install subversion. A dependency resolver, such as RPMforge, may be required to install subversion.
yum install subversion
cron garbage collection:
These instructions provide the cron and Rocks kickstart cron commands to add garbage collection of /tmp for all nodes.
First, create a cron job on the HN and GN. As root (su -) on each, edit /var/spool/cron/root and add the line:
6 * * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;
This will remove day-old files in /tmp on the HN & GN every hour on the 6th minute and week-old empty directories in /tmp every Saturday at 2:36.
Now create the cron job on the WNs & INs:
- Edit /home/install/site-profiles/4.3/nodes/extend-compute.xml and place the following commands inside the <post></post> brackets:
<!-- Create a cron job that garbage-collects /tmp -->
<file name="/var/spool/cron/root" mode="append">
6 * * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;
</file> - Create the new distribution:
cd /home/install
rocks-dist dist - Re-install the WNs & INs
Condor
We install Condor using the Rocks roll, then modify it to add Condor_G as a part of the OSG installation. To be safe, you should configure condor after you've installed OSG. These instructions are based on the very complete guide provided by Condor.
First we handle two issues: (1) there is a domain mismatch between internal and external hostnames from Rocks, and (2) CMSSW jobs cannot be evicted and resumed without loss of compute cycles. On the HN as root (su -):
- Edit /opt/condor/etc/condor_config.local and add the lines:
TRUST_UID_DOMAIN = True PREEMPTION_REQUIREMENTS = False NEGOTIATOR_CONSIDER_PREEMPTION = False CLAIM_WORKLIFE = 300 WANT_SUSPEND = True SUSPEND = ( (CpuBusyTime > 2 * $(MINUTE)) \ && $(ActivationTimer) > 300 ) CONTINUE = $(CPUIdle) && ($(ActivityTimer) > 10) PREEMPT = False
- Replace the original Rocks Condor roll xml file that creates the condor_config.local file on the other nodes:
cp /home/install/rocks-dist/lan/x86_64/build/nodes/condor-client.xml /home/install/site-profiles/4.3/nodes/replace-condor-client.xml - Edit /home/install/site-profiles/4.3/nodes/replace-condor-client.xml and add the following inside the cat of /opt/condor/etc/condor_config.local (between lines with CONFEOF):
TRUST_UID_DOMAIN = True PREEMPTION_REQUIREMENTS = False NEGOTIATOR_CONSIDER_PREEMPTION = False CLAIM_WORKLIFE = 300 WANT_SUSPEND = True SUSPEND = ( (CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 300 ) CONTINUE = $(CPUIdle) && ($(ActivityTimer) > 10) PREEMPT = False
Additionally, the interactive and grid nodes should not actually service condor jobs, they should only submit them. We fix this by copying the entire <file name="/etc/rc.d/rocksconfig.d/post-90-condor-client"> section of replace-condor-client.xml to the interactive.xml and grid.xml Kickstart files and replace the CondorConf tag "-t se" with "-t s".
Now we need to restart services, create the new Rocks distribution, and reinstall all the non-HN nodes. As root (su -) on the HN:
- Restart the Condor service on the HN:
/etc/init.d/rocks-condor restart - If OSG is installed, the OSG condor-devel service (as well as RSV, which uses condor-devel) needs to be restarted:
ssh grid-0-0
cd /sharesoft/osg/ce
vdt-control --off osg-rsv condor-devel
vdt-control --on condor-devel osg-rsv - Create the new Rocks distribution:
cd /home/install
rocks-dist dist - Reinstall the other nodes.
We create a simple condor monitoring script that will route output to the web server, to be viewed by users:
- Create the file /root/condor-status-script.sh with the contents:
#!/bin/bash
. /root/.bashrc
OUTPUT=/var/www/html/condor_status.txt
echo -e " \n\n" >$OUTPUT
echo -e "As of `date` \n">>$OUTPUT
/opt/condor/bin/condor_status -submitters >>$OUTPUT
/opt/condor/bin/condor_userprio -all >>$OUTPUT
/opt/condor/bin/condor_status -run >>$OUTPUT - Run it every 10 minutes by editing /var/spool/cron/root and adding the line:
1,11,21,31,41,51 * * * * /root/condor-status-script.sh - Output will be here.
Condor keeps logs in /var/opt/condor/log, StartLog & StarterLog are particularly useful. Generally, the most information can be found on the node which serviced (not submitted) the job you are attempting to get info on.
Backup critical files
The files below should be backed up to a secure non-cluster location. Marguerite Tonjes currently maintains the backup of these files. Users can use /data as a backup location, but this is not a sufficient backup location for these critical admin files. Note that many of these files are readable only by root.
- In ~root:
- network-ports.txt
- configure-external-network.sh
- security.txt
- hepcms-0cert.pem
- hepcms-0key.pem
- http-hepcms-0cert.pem
- http-hepcms-0key.pem
- OSG (directory and contents)
- condor-status-script.sh
- In /etc:
- krb5.conf
- krb.conf
- krb.realms
- fstab
- exports
- auto.master
- auto.home
- auto.software
- skel (directory and contents)
- sysconfig/iptables
- In /home/install/site-profiles/4.3/nodes:
- extend-compute.xml
- replace-auto-partition.xml
- replace-auto-kickstart.xml (if it exists)
- replace-condor-client.xml
- interactive.xml
- grid.xml
- In /home/install/site-profiles/4.3/graphs/default:
- interactive.xml
- grid.xml
- /home/phedex/phedex.tgz
- /var/www/html/index.html
- /var/spool/cron (directory and contents)
- /sharesoft/osg/ce/osg/etc/config.ini
- /sharesoft/cmssw/SITECONF (directory and contents)
We have a backup script, /root/backup-script.sh, which is run by cron on a weekly basis. It will copy all the needed files to /root/backup, which should then be manually copied from the cluster to a different machine on a regular basis.
Note: /sharesoft/osg/ce cannot realistically be used to recover from total HN failure because some OSG services are placed outside of /sharesoft/osg/ce. But it's usually safe to recover from a backup of this directory when attempting to perform OSG software upgrades. When performing the backup, be sure to preserve existing permissions (cp -pr /sharesoft/osg/ce <backup dir>).
Recover from failure
Note: This section is currently inaccurate and under modifications due to our recent change in site configuration.
A HN failure which requires HN reboot is relatively easy to deal with and simply involves the manual starting of a few services. A HN failure which requires reinstall is difficult because the WNs must be reinstalled as well. Instructions are also provided to powerdown the entire cluster and turn the entire cluster back on. This Rocks guide can help to upgrade or reconfigure the HN with minimal impact - you may want to append the files listed here to the FILEs directive in version.mk (files in /home/install/site-profiles are saved automatically).
- Power down and up procedures
- Recover from HN reboot
- Recover from HN reinstall
- Recover from GN reboot
- Recover from GN reinstall
Power down and up procedures
Before powering down, make sure you have a recent copy of the critical files to backup. Our backup script places all the need critical files in /root/backup on a weekly basis. To power down, login to the HN as root (su -):
- cd /sharesoft/osg/ce
- . setup.sh
- vdt-control --off
- condor_status will show if any jobs are running, if they are, shut down condor without killing jobs following this condor recipe
- ssh-agent $SHELL
- ssh-add
- cluster-fork "poweroff"
- poweroff
If you are concerned about the possibility of power spikes, go to the RDC:
- Flip both power switches on the back of the big disk array.
- Fkip the power switch on the KVM (in the back of the rack).
- Turn the UPS off by pressing the O (circle) button.
- Flip the power switch on the back of the UPS.
- Flip the power switches on both large PDU's, in the middle of the
rack. Each large PDU has two switches. - Remove the floor tile directly behind the cluster.
- If possible without undue strain to the connectors, unplug both power cables from their sockets.
- Replace the floor tile.
To power up, go to the RDC:
If applicable:
- Remove the floor tile directly behind the cluster.
- Plug in power cables in the floor.
- Replace the floor tile.
- Flip UPS, big PDU, and KVM power switches.
- Turn UPS on by pressing | / Test button on the front.
- Turn the big disk array on by flipping both switches in the back. Flip one switch, wait for the disks and fans to spin up, then spin down. Then flip the second switch.
Once the big disk array fans and disks have spun down from their initial spin up:
- Press power button on HN. Wait for it to boot completely.
- Power cycle the switch using its power cable (the switch has no switch
hardy har har). - Login on the HN as root, start the GUI environment (startx).
- Open an internet browser and enter the address 10.255.255.254. If you don't get a response, wait a few more minutes for the switch to complete its startup, diagnoses, and configuration.
- Log into the switch (user name and password can be obtained from Marguerite Tonjes).
- Under Switching->Spanning Tree->Global Settings, select Disable from the "Spanning Tree Status" drop down menu. Click "Apply Changes" at the bottom.
- Press the power buttons on all eight WNs. Wait a few seconds between each one.
- Follow the procedure below to recover from HN reboot.
Note: While our cluster has the ability to be powered up completely from a network connection, it has not yet been configured. At the present time, powering up requires a visit to the RDC.
Recover from HN reboot
BeStMan & OSG should be started automatically at boot time. As root (su -) on the HN:
- Check RSV probes.
If any probes are failing, it may be due to cron maintenance jobs for OSG which haven't run yet. Issue the command:
crontab -l | grep osg
and scan for any jobs with names that are similar to the failing probe. Execute the command manually and wait for the next RSV probe to run. - If you rebooted the phedex node (phedex-node-0-7), you must restart the PhEDEx services following these instructions.
- PhEDEx, if still running, will reconnect with the BeStMan service automatically. You can verify that the instances are still running by checking the files on phedex-node-0-7:
/scratch/phedex/current/Debug_T3_US_UMD/logs/download-srm
/scratch/phedex/current/Prod_T3_US_UMD/logs/download-srm
If PhEDEx does not reconnect, follow these instructions to stop and start the PhEDEx services. - Check Ganglia. All nodes should be reporting, it is highly unlikely that HN reboot alone would cause WNs to stop reporting. However, if they are not reporting, try restarting the Ganglia service:
/etc/init.d/gmond restart
/etc/init.d/gmetad restart
If a node is still not reporting, you can attempt to reboot the WN:
ssh-agent $SHELL
ssh-add
ssh compute-x-y 'reboot'
or, to reboot all WNs:
cluster-fork "reboot"
Recover from GN reboot
Recover from HN reinstall
- Install the HN and WNs following the Rocks installation instructions. Using the Rocks boot disk instead of PXE boot for the WNs has a higher probability of success. Forcing the default partitioning scheme for the WNs also has a higher probability of success. Don't forget to power cycle the switch and configure it via a web browser.
- Copy the backed up critical files to /root/backup. Make sure the read/write permissions are set correctly for each file. As root (su -):
cd /root/backup - Create at least one new user.
- Configure security following the instructions in security.txt.
- Copy the info and certificate files to the correct directory:
cp security.txt ../.
cp network-ports.txt ../.
cp configure-external-network.sh ../.
cp hepcms-0cert.pem ../.
cp hepcms-0key.pem ../.
cp http-hepcms-0cert.pem ../.
cp http-hepcms-0key.pem ../. - Follow the instructions in this How-To guide to change the WN partitions (if necessary), mount the big disk, place the WNs on the external network and install xemacs and emacs on both the HN and WNs. Instructions which call for rocks-dist dist (and the accompanying shoot-node) can be stacked. Shoot the nodes (re-install the WNs) once after configuring Rocks for the disks, network and emacs. Then install all the software (the CRAB & PhEDEx nodes must be shot one more time). A few notes:
- Backed up copies of many of the modified files should already be made, so there should be very few manual file edits. Be sure to save the original files in case of failure.
- The boot order of the WNs may have changed, so the Rocks name assignment may correspond to a different physical node. The external IP addresses map to an exact patch panel port number, so move the network cables to the correct port on the patch panel. Use /root/network-ports.txt as your guide and be sure to modify it with the new switch port numbers (or move the switch port cables if you prefer -- the switch doesn't care). You may also want to modify the LEDs displaying the Rocks internal name, which can be done at boot time (strike F2 during boot to get to setup), under "Embedded Server Management."
Recover from GN reinstall
Although a Rocks appliance, the grid node is never intended to be reinstalled via Rocks kickstart. It is installed once from Rocks kickstart and all subsequent installs are done from its command line. If issuing a shoot-node on the grid node is absolutely necessary, the relevant software and hardware which must be reconfigured is:
The big disk array
CMSSW
OSG
PhEDEx
Solutions to encountered errors
Errors are organized by the program which caused them:
- RAID
- Rocks
- Condor
- Logical volume (LVM)
- CMSSW
- gcc/g++/gcc4/g++4
- Dell OpenManage
- YUM
- gLite
- srm
- OSG/RSV
- SiteDB/PhEDEx
RAID
- During HN boot:
Foreign configuration(s) found on adapter.
Followed by:
1 Virtual Drive(s) found
1 Virtual Drive(s) offline
3 Virtual Drive(s) handled by BIOS
This Dell troubleshooting guide is a useful resource. In our case, this occurred because we booted the HN before the disk array had fully powered up. We believe this also corrupted the PERC-6/E RAID controller configuration. Upon subsequent shut down of the HN, full disk array power-up, followed by powering the HN again, we loaded the foreign configuration (pressed the key f). The RAID controller can also be configured again using the configuration utility (c or Ctrl+r).
Rocks
- NameError: global name 'FileCopyException' is not defined (inspection of other terminals shows that comps.xml is missing)
Rocks 4.3 needs a special 'comps' roll for SL4.5. It must be downloaded, placed on disk, and selected as a roll at install time. - An error occurred when attempting to load an installer interface component className=FDiskWindow"
Rocks is complaining that the partition table in the kickstart file is incorrect. Check /home/install/site-profiles/4.3/nodes/replace-auto-partition.xml for syntactic problems (Beware! You may lose existing data!). If your system is having very serious partition issues, or this file does not exist, try these instructions to force the default Rocks partitioning scheme. Once replace-auto-partition.xml is repaired, issue the rocks-dist dist command from the /home/install directory. Depending on your situation, you may need to force the nodes to load the new kickstart file. - After a WN installs successfully, it reboots with the error:
mkrootdev: label /1 not found
Mounting root filesystem
mount: error 2 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!
This error can occur when using non-default partitioning on the WNs and is due to disk LABEL synchronization issues. The Rocks authors have seen this error before, but are unable to reproduce the conditions which cause it to occur. In order to prevent failures of this type from occurring both when first attempting to use non-default partitioning and when calling shoot-node after successful reinstall, add the following to /home/install/site-profiles/4.3/nodes/extend-compute.xml in the <post> section:
e2label /dev/sda1 /
cat /etc/fstab | sed -e s_LABEL=/1_LABEL=/_ > /tmp/fstab
cp -f /tmp/fstab /etc/fstab
cat /boot/grub/grub-orig.conf | sed -e s_LABEL=/1_LABEL=/_
> /tmp/grub.conf
cp -f /tmp/grub.conf /boot/grub/grub-orig.conf
chmod +w /boot/grub/grub-orig.conf
unlink /boot/grub/grub.conf
ln -s /boot/grub/grub-orig.conf /boot/grub/grub.conf
This will force all files which use disk LABELs to be in agreement with one another. It also makes grub-orig.conf writeable so Kernel updates can modify the file and makes the symlink a full path instead of relative, which also causes problems with Kernel updates. Be sure to create the new rocks distribution:
cd /home/install
rocks-dist dist
And use the Rocks boot disk to get the WN to reinstall - we've found that PxE boot is not consistently successful in the event of a Kernel panic. - shoot-node gives errors:
Waiting for ssh server on [compute-0-1] to start
ssh: connect to host compute-0-1 port 2200: Connection refused
...
Waiting for VNC server on [compute-0-1] to start
Can't connect to VNC server after 2 minutes
ssh: connect to host compute-0-1 port 2200: Connection refused
...
main: unable to connect to host: Connection refused (111)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/scratch/home/build/rocks-release/rocks/src/roll/base/src/foundation-python/foundation-python.buildroot//opt/rocks/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/opt/rocks/sbin/shoot-node", line 313, in run
os.unlink(self.known_hosts)
OSError: [Errno 2] No such file or directory: '/tmp/.known_hosts_compute-0-1'
and examination of WNs reveals they are trying to install interactively (i.e., requesting language for the install, etc.):
This seems to occur most commonly when there is a problem with the .xml files used for the Rocks distribution. The solution which works most consistently is to remove all your modified .xml files in /home/install/site-profiles/4.3/nodes (leave skeleton.xml) and force default partitioning. Then reinstall the WNs -- you will have to manually restart the WNs as they will remain in the interactive install state until manual intervention. Failing this, reinstall the entire cluster, although this will not guarantee success if you use the same .xml files. - shoot-node & cluster-kickstart give the error:
error reading information on service rocks-grub: No such file or directory
cannot reboot: /sbin/chkconfig failed: Illegal seek
This occurs when the rocks-boot-auto package is removed, which prevents WNs from automatically reinstalling every time they experience a hard boot (such as power failure). This error can be safely ignored as it does not actually prevent the node from rebooting and reinstalling from the kickstart when the reinstall commands are manually issued. - Wordpress gives the error:
We were able to connect to the database server (which means your username and password is okay) but not able to select the wordpress database.
the MySQL Rocks web interface says on the left-side bar:
No databases
but Rocks commands still work.
We are unsure what caused this error. We attempted various service restarts to no avail. In the end, rebooting the HN solved the issue. We experienced no apparent Rocks DB corruption as a result of this error.
Condor
- Condor job submission works from the HN, but none of the WNs. Errors in the condor include "Permission denied" and "Command not found." Examination of /var/opt/condor/log/StarterLog shows the error:
ERROR: the submitting host claims to be in our UidDomain (UMD.EDU), yet its hostname (compute-0-1.local) does not match. If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name. To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.
This occurs because jobs are submitted via the local network, so the submitting node has the name compute-x-y.local, instead of HEPCMS-X.UMD.EDU. The easiest fix is to set TRUST_UID_DOMAIN = True in the /opt/condor/etc/condor_config.local files on both the HN and WNs. Instructions are outlined here.
LVM
- Insufficient free extents (2323359) in volume group data: 2323360 required (error is received on command lvcreate -L 9293440MB data).
Sometimes it is simpler to enter the value in extents (the smallest logical units LVM uses to manage volume space). Use a '-l' instead of '-L' and specify the maximum number of free extents (provided by the error):
lvcreate -l 2323359 data
CMSSW
- cmsRun works on the HN, but none of the WNs.
Note this error could be caused by any number of issues; when we encountered this error, it was because our 'one-time' CMSSW install had the environment variable VO_CMS_SW_DIR set to the real directory where the data resides on the HN, rather than the network mounted directory (with a different name) that 'points' to the real directory. For example, the physical partition where CMSSW is installed on the GN is /scratch/cmssw. However, the network mounted directory is named /sharesoft/cmssw. Set VO_CMS_SW_DIR to /sharesoft/cmssw, rather than /scratch/cmssw. We found removing the contents of /scratch/cmssw prior to complete re-install helped. - E: Sub-process /sharesoft/cmssw/slc4_ia32_gcc345/external/apt/0.5.15lorg3.2-CMS19c/bin/rpm-wrapper returned an error code (100)
This link suggests that it is due to a lack of disk space in the area where you are installing CMSSW. However, because we install in /sharesoft and /sharesoft is auto-network mounted, the size of /sharesoft doesn't print until it's been explicitly ls'ed or cd'ed. When RPM checks that there is enough space in /sharesoft to install, it fails. When executing apt-get, add the option:
apt-get -o RPM::Install-Options::="--ignoresize" ... - error: unpacking of archive failed on file /share/apps/cmssw/share/scramdbv0: cpio: mkdir failed - Permission denied
This error occurs because both bootstrap.sh and the CMSSW apt-get install create a soft-link to the 'root' directory where CMSSW is being installed. In our case, since we first tried to install CMSSW to /share/apps (automatically network mounted by Rocks), the soft link is named share. However, CMSSW also has a true subdirectory named share and does write files to this directory. The soft link overrides the true directory and resultantly, CMSSW tries to install to /share, where it does not have permission. In short, CMSSW cannot be installed to any directory named /share, /common, /bin, /tmp, /var, or /slc4_XXX. Follow the CMSSW installation guide for directions on network mounting /scratch/cmssw as /sharesoft/cmssw. - apt-get update issues the error:
E: Could not open lock file /var/state/apt/lists/lock - open (13 Permission denied)
E: Unable to lock the list directory
Be sure to first source the scram apt info:
source $VO_CMS_SW_DIR/$SCRAM_ARCH/external/apt/0.5.15lorg3.2-CMS19c/etc/profile.d/init.csh
gcc/g++/gcc4/g++4
- Attempts to compile code gives errors about missing libraries, including:
stddef.h: No such file or directory
bits/c++locale.h: No such file or directory
bits/c++config.h: No such file or directory
This could be caused by any number of issues. In our case, the gcc-c++ and gcc4-c++ packages needed the libstdc++-devel package. To install it:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/libstdc++-devel-3.4.6-8.x86_64.rpm"
Dell OpenManage
- srvadmin-install.sh gives the error:
libstdc++.so.5 is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5(GLIBCPP_3.2) is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5(GLIBCPP_3.2.2) is needed by srvadmin-omacore-5.5.0-364.i386
libstdc++.so.5 is needed by srvadmin-rac5-components-5.5.0-364.i386
While we have compat-libstdc++-33-3.2.3-47.3.x86_64 installed, Dell needs the i386 version. Get it by:
wget "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/SL/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm"
rpm -ivh compat-libstdc++-33-3.2.3-47.3.i386.rpm
YUM
- Transaction Check Error: file /etc/httpd/modules from install of httpd-2.0.52-38.sl4.2 conflicts with file from package tomcat-connectors-1.2.20-0
We encountered this error on a call to yum update, which was needed for our gLite UI installation. We removed the tomcat-connectors package and encountered no further issues. We also removed the tomcat5 package for extra measure, but that may not be necessary.
yum remove tomcat5
yum remove tomcat-connectors
yum clean all
yum update
gLite
- Error: Missing Dependency: perl(SOAP::Lite) is needed by package glite-data-transfer-api-perl
Error: Missing Dependency: perl(SOAP::Lite) is needed by package glite-data-catalog-api-perl
gLite UI requires the SOAP Lite Perl. SOAP is a difficult install due to the sheer quantity of dependencies on other packages. An excellent dependency resolver is available from RPMforge and makes the SOAP install a breeze. These instructions are for our particular OS and architecture:
cd /usr/src/redhat/RPMS/x86_64
wget "http://packages.sw.be/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm"
rpm -Uhv rpmforge-release-0.3.6-1.el4.rf.x86_64.rpm
cd ../noarch
wget "http://dag.wieers.com/rpm/packages/perl-SOAP-Lite/perl-SOAP-Lite-0.71-1.el4.rf.noarch.rpm"
yum localinstall perl-SOAP-Lite-0.71-1.el4.rf.noarch.rpm
Note: This error will only occur if you are attempting to do the apt-get style installation of gLite-UI. The tarball installation of gLite-UI is self-contained and you should not encounter this error, nor need RPMforge.
SRM
- srmcp issues the error:
GridftpClient: Was not able to send checksum
value:org.globus.ftp.exception.ServerException: Server refused
performing the request. Custom message: (error code 1) [Nested
exception message: Custom message: Unexpected reply: 500 Invalid
command.] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
message: Unexpected reply: 500 Invalid command.]
but the file transfer is successful.
This error occurs because srmcp is an srm client developed by dCache with the special added functionality of a checksum. BeStMan uses the LBNL srm client and does not support srmcp checksum functionality, nor does Globus gridftp. This error can be safely ignored.
OSG/RSV
- After HN boot:
RSV jobs are in the condor production queue instead of the condor development queue.
This bug occurs only when running OSG 0.8.0 with the upgraded RSV V2. For a permanent fix, install RSV V2 as a standalone product, rather than an upgrade to RSV V1 or install OSG 1.0. Alternatively, you can fix this issue every time the HN reboots. As root (su -) on the HN:- Stop the osg-rsv service:
cd /sharesoft/osg/ce
. setup.sh
vdt-control --off osg-rsv - Kill the jobs in the condor production queue:
condor_q
condor_rm #, where # is the batch number of any jobs being run by rsvuser - Restart the osg-rsv service and check the queues:
vdt-control --on osg-rsv
condor_q (to check the production queue)
su - rsvuser
condor_q (to check the development queue) - If jobs are not in the production queue, but are in the development queue, the procedure has worked. Otherwise, we've found repeating these steps several times seems to work.
- Stop the osg-rsv service:
- The RSV cacert-crl-expiry-probe fails with an error to the effect of:
/sharesoft/osg/ce/globus/TRUSTED_CA/1d879c6c.r0 has expired! (nextUpdate=Aug 15 14:28:32 2008 GMT)
and voms-proxy-init fails with the error:
Invalid CRL: The available CRL has expired
This can occur because, for one reason or another, the last cron jobs which should have renewed the certificates did not execute or complete for that particular CA in time. You can manually run the cron jobs by first searching for them in cron:
crontab -l | grep cert
crontab -l | grep crl
then executing them:
/sharesoft/osg/ce/vdt/sbin/vdt-update-certs-wrapper --vdt-install /sharesoft/osg/ce
/sharesoft/osg/ce/fetch-crl/share/doc/fetch-crl-2.6.2/fetch-crl.cron
If fetch-crl.cron prints errors about "download no data from... persistent errors.... could not download any CRL from...", ignore them as long as voms-proxy-init works when fetch-crl.cron completes. - configure-osg.py -c chokes and vdt-install.log says:
##########
# configure-osg.py invoked at Tue Oct 28 15:36:56 2008
##########
### 2008-10-28 15:36:56,792 configure-osg ERROR In RSV section
### 2008-10-28 15:36:56,792 configure-osg ERROR Invalid domain in
gridftp_hosts: UNAVAILABLE
### 2008-10-28 15:36:56,792 configure-osg CRITICAL Invalid attributes
found, exiting
########## [configure-osg] completed
This is because gridftp_hosts does not actually accept UNAVAILABLE (despite the comments). Simply set gridftp_hosts=%(localhost)s in config.ini and try running configure-osg.py -c again. RSV will be rolling out a fix for this very soon (as of October 31, 2008). - MyOSG GIP tests give the error when using gridmapfile in the OSG CE:
GLUE Entity GlueSEAccessProtocolLocalID does not exist
CEMon now gets information for BDII by issuing various srm commands using your http host cert. The distinguished name (DN) of your http host cert needs to be added to your grid-mapfile-local and mapped to a user account.
SiteDB/PhEDEx
- After attempting to log in to PhEDEx via certificate, a window pops up several times requesting your grid cert (already imported into your browser) and after multiple OK's, eventually goes to a page with the message:
Have You Signed Up?
You need to sign up with CMS Web Services in order to log in and use privileged features. Signing up can be done via SiteDB.
If you have already signed up with SiteDB, it is possible that your certificate or password information is out of date there. In that case go back to SiteDB and update your information.
For your information, the DN your browser presents is:
/DC=something/DC=something/OU=something/CN=Your Name ID#
This problem occurs when your SiteDB/hypernews account is not linked with your grid certificate. Go to the SiteDB::Person Directory (SiteDB only works in the Firefox browser), login with your hypernews account and follow the link under the title labeled "Edit your own details here". In the form entry box titled "Distinguished Name", enter the DN info displayed earlier and click on the "Edit these details" button. You should then be able to login to PhEDEx with your grid certificate in 30-60 minutes.