How To: Guides for users and Maryland T3 admins.

Help: Links and emails for further info.

Configuration: technical layout of the cluster, primarily for admins.

Log: Has been moved to a Google page, accessible only to admins.

Rocks & General Administration - USELESS to 99% in SL6!!!

Description General system administration, relevant especially for Rocks. Install Rocks & Scientific Linux 5, modify Rocks Kickstarts and create appliances. Manage users, update RPMs, configure Kerberos, set up garbage collection, manage Condor.
Notes

This guide is designed for our SL5 cluster configuration. We link to the Rocks 5.4 guide installed on our cluster. To access the official Rocks guides, which may be for a different Rocks release, go here.

It actually is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation.

Last modified September 11, 2015

We use Rocks as our cluster management software. Rocks supports CentOS, RHEL, and SL. We install Rocks 5.4 with Scientific Linux 5.4 64 bit. Rocks uses a single frontend node and appliances. A Rocks frontend node serves as the internal network gateway, the gateway to the external network if other nodes only have internal addresses, the software distribution point for bringing up and reinstalling nodes, and the NFS mounted /home directory for users. The core Rocks appliance is a compute node, which primarily is a member of the batch pool of whatever batch software you use. Additional appliances can be created, such as interactive nodes or grid services nodes. All appliances are managed by Rocks Kickstart files, which requires reinstall of the affected nodes whenever modified. Software is added to Rocks via rolls or manually by placing the needed files in a contrib directory and modifying the appropriate appliance's Kickstart file.

Table of Contents - USELESS to 99% in SL6!!!

Install Rocks on the HN

Description Install Rocks 5.4 on Scientific Linux 5.4, x86_64 architecture on the cluster head node/frontend.
Dependencies None.
Notes Rocks downloads are available here, SL is available here.
Guides - Rocks 5.4 user's guide
  1. Download the Rocks Kernel/Boot roll (best on a CD for recovery)
  2. Download the Rocks Core roll
  3. Download Scientific Linux 5.4 (all CDs or DVDs)
  4. Burn all the .iso's to disks
  5. Follow the Rocks 5.4 user's guide to install Rocks on the head node. Additions to the guide:
    1. Our network configuration is detailed here. The initial boot phase is on a timer and will terminate if you do not enter the network information quickly enough.
    2. Be sure to add the kernel roll.
    3. We selected the area51, base, ganglia and web-server rolls from the Core CD.
    4. Insert each SL5.4 disk in turn and select the LTS roll listed.
    5. As far as we know, the questions about certificate information on the "Cluster Information" screen is not used by any applications that we install. We entered the following, which may or may not be correct:

      FQHN: hepcms-hn.umd.edu
      Name: UMD HEP CMS T3
      Certificate Organization: DOEgrids
      Certificate Locality: College Park
      Certificate State: Maryland
      Certificate Country: US
      Contact: mtonjes@nospam.umd.edu (w/o the nospam)
      URL: http://hep-t3.physics.umd.edu
      Latitude/Longitude: N38.98 W-76.92

    6. We use manual partitioning and allocate the following partition table (if you wish to preserve existing data, be sure to restore the partition table and don't modify any you wish to keep other than to specify the mount directory name):

      /dev/sda :
      /         16384 /sda1 ext3
      swap      16384 /sda2 swap
      /var       8192 /sda3 ext3
      /sda4 is the extended partition which includes /sda5
      /scratch  28411 /sda5 ext3 (fill to max available size)

      /dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
      /export  418168  /sdb1 ext3 (fill to max available size)

      Rocks will always reformat the root / partition.

    7. On your first login to the HN, you will be prompted to generate rsa keys, you should do so.
    8. Security needs to be configured (quickly). If you are another site following these instructions, you can contact Marguerite Tonjes for our security setup document, on which to start configuration of your local site security (security tends to be site specific and we don't claim our security is fool-proof). Your identity will need to be confirmed by Marguerite Tonjes.
    9. Be sure to update rpms on the appropriate nodes after they are installed.

 

Add network switch to Rocks

Description Give the network switch an internal IP address from the HN via DHCP.
Dependencies - Rocks installed on HN
- Network switch configured to request IP via DHCP
Notes  
Guides - Rocks 5.4 user's guide to install compute nodes
  1. If you have not already done so, be sure to configure the switch via the serial cable to get its IP via DHCP and set a login name and password (for internet management).
  2. Execute insert-ethers and select "Ethernet switches" to get Rocks to recognize the switch and supply it with an internal IP. The switch can take a long time to issue DHCP requests; wait at least 30 mins.
  3. The switch doesn't actually kickstart, so quit insert-ethers using the F9 key (not F8) once it's recognize the switch as network-0-0.
  4. Open an internet browser on the HN and log into the switch (network-0-0).
  5. We must turn off the switch' Spanning Tree, so follow these instructions to do so.

 

Install Rocks on the WNs

Description Kickstart worker/compute nodes for the first time.
Dependencies - Rocks installed on HN
- Network switch recognized by Rocks
Notes Do not use any of the provided Kickstart files blindly. You will almost certainly have to modify them to suit your needs, especially extend-compute.xml and replace-partition.xml.
Guides - Rocks 5.4 user's guide to install compute nodes
- Add packages to compute nodes
- Custom configuration of compute nodes
- RHEL5 Kickstart configurator : this link moves around a lot - look for RHEL 5 Installation Guide with a section on "Kickstart Configurator"

Description of the sections of our Kickstart file, /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml, is below. Note that you will almost certainly have to modify this file to suit your use case. Do not use it blindly. The commands are described in greater detail in the linked text below.

In addition to extend-compute.xml, we have several additional Kickstart files which replace the default files in the Rocks distribution. Each one is described in greater detail in the linked text. Do not use any of these blindly.

replace-yum.xml : Keep the SL yum repository files.
replace-partition.xml : Use a different partitioning scheme.
replace-auto-kickstart.xml : Prevent nodes from automatically reinstalling whenever the power cycles.

These commands are executed as root (su -) on the Rocks head node.

  1. Navigate to the SL5 LCG (CMSSW) dependency Twiki and download HEP_OSlibs.repo to a location where it can be served from the HN to the WNs during Kickstart:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://grid-deployment.web.cern.ch/grid-deployment/download/HEP/repo/HEP_OSlibs.repo"
  2. Similarly, get krb5.conf from FNAL:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "
    http://security.fnal.gov/krb5.conf"
  3. Get the Condor yum repo as well:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://www.cs.wisc.edu/condor/yum/repo.d/condor-stable-rhel5.repo
    "
  4. Follow the instructions in the Condor section to create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/compute_condor_config.local, the Condor configuration file which will be used on compute nodes.
  5. Get the rpm which installs the Hadoop yum repo:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "
    http://vdt.cs.wisc.edu/hadoop/osg-hadoop-1-2.el5.noarch.rpm"
  6. Follow the instructions in the Hadoop guide to create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/hadoop-config, the Hadoop configuration file used on all nodes. The linked file has dummy entries or values which won't apply to your site. Be sure to edit it.
  7. Follow the instructions in the Hadoop guide to get the correct version of the FUSE Kernel module from ATrpms. Place it in the contrib directory, e.g.:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://dl.atrpms.net/all/fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm"
  8. Follow the instructions in the Hadoop guide to get the most recent version of FUSE from ATrpms. Place it in the contrib directory, e.g.:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://dl.atrpms.net/all/fuse-2.7.4-8_12.el5.x86_64.rpm"
  9. After editing each file to suit your needs, place extend-compute.xml, replace-yum.xml, replace-partition.xml, and replace-auto-kickstart.xml in /export/rocks/install/site-profiles/5.4/nodes.
  10. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  11. Verify that the new XML code is correct:
    rocks list appliance xml compute
    If this throws an exception, the last line states where the syntax problem is.
  12. Follow the Rocks guide for installing your compute nodes. Specifically, install the worker nodes by calling insert-ethers, selecting Compute, powering up the new nodes and selecting PXE boot on the new nodes as they boot by striking the F12 key on each. Alternatively, insert the Rocks Kernel/Boot CD into each WN shortly after pressing the power button.

 

Create the grid appliance

Description Create a Rocks appliance which serves as the basis for our grid node.
Dependencies - Rocks installed on HN
- WN Rocks Kickstart configured
- Network switch recognized by Rocks
Notes

Our grid node Kickstart file is basic because OSG cannot be preserved via tarball. The grid appliance is not intended for subsequent reinstall. The grid appliance 'inherits' from the compute appliance, so will get all modifications made to the WNs.

Guides - Rocks 5.4 user's guide to create appliances
- Add packages
- Custom configuration

Description of the sections of the Kickstart file, grid.xml, is below. Note that you will almost certainly have to modify it to suit your use case. Do not use it blindly.

These commands are executed as root (su -) on the Rocks head node.

  1. If you haven't already done so for the interactive appliance, follow the instructions in the Condor section to create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/interactive_condor_config.local, the Condor configuration file which will be used on interactive nodes (and the grid node).
  2. Place the files grid.xml in /export/rocks/install/site-profiles/5.4/nodes and grid-appliance.xml in /export/rocks/install/site-profiles/5.4/graphs/default.
  3. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  4. Add an entry for the new grid appliance to the Rocks MySQL database:
    rocks add appliance grid membership='Grid Management Node' short-name='gr' node='grid'
  5. Verify that the new XML code is correct:
    rocks list appliance xml grid
    If this throws an exception, the last line states where the syntax problem is.
  6. Now install the grid node by calling insert-ethers, selecting Grid Management Node, powering up the new node and selecting PXE boot on the new node as it boots.
  7. Configure its external network interface.

 

Create the interactive appliance

Description Create a Rocks appliance which services interactive users.
Dependencies - Rocks installed on HN
- WN Rocks Kickstart configured
- Network switch recognized by Rocks
Notes Since CRAB & gLite-UI can be installed via tarballs, interactive nodes can be reinstalled via Rocks Kickstart. The interactive appliance 'inherits' from the compute appliance, so will get all modifications made to the WNs.
Guides - Rocks 5.4 user's guide to create appliances
- Add packages
- Custom configuration

Description of the sections of the Kickstart file, interactive.xml, is below. Note that you will almost certainly have to modify it to suit your use case. Do not use it blindly.

These commands are executed as root (su -) on the Rocks head node:

  1. Navigate to the gLite-UI tarball repository and select your desired version of gLite-UI. These instructions are for 3.2.8-0, though they can be adapted for later releases. Download the lcg-CA yum repo file and gLite-UI tarballs where they can be served from the HN:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo"
    wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL5_x86_64/glite-UI-3.2.8-0.sl5.tar.gz"
    wget "http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-UI/SL5_x86_64/glite-UI-3.2.8-0.sl5-external.tar.gz"
  2. Navigate to the CRAB download page and select your desired version of CRAB. These instructions are for 2_7_7, though they can be adapted for later releases. Download the tarball where it can be served from the HN:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "http://cmsdoc.cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_7_7.tgz"
  3. If you haven't already done so for the grid appliance, follow the instructions in the Condor section to create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/interactive_condor_config.local, the Condor configuration file which will be used on interactive nodes (and the grid node).
  4. Place the file interactive.xml in /export/rocks/install/site-profiles/5.4/nodes and interactive-appliance.xml in /export/rocks/install/site-profiles/5.4/graphs/default.
  5. Edit interactive.xml to change settings in the created site-info.def file as appropriate for your site, discussed in the CRAB section.
  6. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  7. Add an entry for the new interactive appliance to the Rocks MySQL database:
    rocks add appliance interactive membership='Interactive Node' short-name='in' node='interactive'
  8. Verify that the new XML code is correct:
    rocks list appliance xml interactive
    If this throws an exception, there is a problem with your xml syntax. Usually the last line in the exception states where the syntax problem is.
  9. Now install the interactive nodes by calling insert-ethers, selecting Interactive Node, powering up the new node and selecting PXE boot (F12) on the new node as it boots.

 

Create the Hadoop namenode appliance (SE)

Description Create a Rocks appliance which serves as the basis for our Hadoop namenode.
Dependencies - Rocks installed on HN
- WN Rocks Kickstart configured
- Network switch recognized by Rocks
Notes

We called the Hadoop appliance "SE" in our Rocks commands. This is really a misnomer. The OSG SE, BeStMan-Gateway, runs on the grid node. However, the SE does contact the Hadoop namenode that run on this appliance. The Hadoop appliance 'inherits' from the compute appliance, so will get all modifications made to the WNs, including the directives to install and configure the base Hadoop software.

Guides - Rocks 5.4 user's guide to create appliances
- Add packages
- Custom configuration
- OSG Hadoop planning document
- OSG Hadoop installation

Description of the sections of the Kickstart file, SE.xml, is below. Note that you will almost certainly have to modify it to suit your use case. Do not use it blindly.

These commands are executed as root (su -) on the Rocks head node.

  1. If you haven't already done so for the grid appliance, follow the instructions in the Condor section to create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/interactive_condor_config.local, the Condor configuration file which will be used on the Hadoop namenode (and the grid and interactive nodes).
  2. Place the files SE.xml in /export/rocks/install/site-profiles/5.4/nodes and SE-appliance.xml in /export/rocks/install/site-profiles/5.4/graphs/default.
  3. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  4. Add an entry for the new appliance to the Rocks MySQL database:
    rocks add appliance SE membership='Hadoop Namenode' short-name='se' node='SE'
  5. Verify that the new XML code is correct:
    rocks list appliance xml SE
    If this throws an exception, the last line states where the syntax problem is.
  6. Now install the Hadoop namenode by calling insert-ethers, selecting "Hadoop Namenode", powering up the new node and selecting PXE boot on the new node as it boots.

 

Re-install nodes

Prevent automatic re-install

Description Prevent Rocks from reinstalling nodes when they power cycle.
Dependencies - Rocks installed on HN
Notes Follow the official Rocks FAQ. After removing the automatic reinstall feature, calls to shoot-node and cluster-kickstart will print the error:
cannot remove lock: unlink failed: No such file or directory
error reading information on service rocks-grub: No such file or directory

Which can be safely ignored.
Guides - Rocks FAQ: disabling automatic reinstall

Kickstart nodes

Description Install or reinstall nodes by Kickstarting them.
Dependencies - Rocks installed on HN
Notes rocks run host can be used to propagate changes to nodes without Kickstarting them. Use an ssh-agent before calling shoot-node or rocks run host.
Guides - RHEL5 Kickstart configurator : this link moves around a lot - look for RHEL 5 Installation Guide with a section on "Kickstart Configurator"

Some modifications in Rocks will require nodes be reinstalled. This tends to be true in cases which require you to issue the command 'rocks create distro,' typically because you edited a Kickstart xml file. In most cases, this involves simply issuing:

ssh-agent $SHELL
ssh-add
shoot-node compute-0-0
(repeat for all desired nodes)

Be sure to consult the recovery guide to handle any issues that cannot be handled via Kickstart.

Since Rocks requires a reinstall of nodes every time a change is made to their kickstart files, you may want to wait until a scheduled maintenance time to reinstall. The "rocks run host" command is useful to get the desired functionality prior to reinstall:

ssh-agent $SHELL
ssh-add
rocks run host "command"

"command" can be anything you'd like run on each WN individually, which could include a network-mounted shell script. rocks run host can also execute on other nodes by specifying the appliance(s) it should run on, e.g., to run on all but the HN:

rocks run host compute interactive grid "command"

rocks run host can timeout, so it's not recommended for executing long-duration commands. In such a case, ssh is best:

ssh-agent $SHELL
ssh-add
ssh compute-0-0 "command"
(repeat for all desired nodes)

 

Modify partitions

Description Change the partition tables on nodes installed via Rocks Kickstart.
Dependencies - Rocks installed on HN
Notes Follow the official Rocks partitioning guide. Rocks will not remove an existing partition unless the nukeit.sh script is called on the node, though it will format unless otherwise specified. Repartitioning requires node reinstall (Kickstart).
Guides - Rocks partitioning guide
- RHEL5 Kickstart partitioning syntax : this link moves around a lot - look for RHEL 5 Installation Guide with a section on "Kickstarting"
- RHEL5 Kickstart configurator : this link moves around a lot - look for RHEL 5 Installation Guide with a section on "Kickstart Configurator"
- Forcing the default partitioning scheme in Rocks

Our partition structure is given in our configuration.

See our replace-partition.xml and grid.xml Kickstart files for an example of preserving existing partitions and existing data. See our interactive.xml Kickstart file for an example of creating a fresh partition structure every time the node is reinstalled.

Although the grid and interactive appliances inherit from the compute appliance, the partitioning scheme specified in the grid and interactive Kickstart files will overwrite the scheme specified in replace-partition.xml for those appliances.

Although the Rocks partition guide suggests you can use the syntax "--ondisk sda", the RHEL "Kickstart Configurator" GUI uses the syntax "--ondisk=sda". We had Kickstart errors (supposedly there wasn't enough from for the "\" partition) when we used ondisk without an equality. Be sure to use "=" for every partition option which takes an argument.

In the event of improper partition syntax, the node will typically go into a Kernel panic when the node attempts to boot after Kickstart. In such a case, it is best to force the default partitioning scheme on the node, install, then try again with the preferred partitioning scheme. You will lose all data on the nodes for which you force the default scheme. Be sure to restore your desired partitioning scheme and create a new Rocks distribution after you've recovered the faulty node.

 

Configure external network

Description Configure the second ethernet port to be a public interface on applicable nodes.
Dependencies - Rocks installed on HN
- The applicable nodes (e.g., grid, interactive) have been Kickstarted
Notes Rocks will not set the public gateway correctly, which must be set manually after every Kickstart. Nodes do not have to be Kickstarted to reconfigure their network settings.
Guides - Rocks network guide

 

 

Enable public_html

Description Allow users to place files in their public_html directory to be browsable at http://hepcms-hn.umd.edu/~username.
Dependencies - Rocks installed on HN
Notes Be sure to read the contents of httpd.conf in the UserDir section, as it outlines the security choice being made by enabling this functionality. Additionally, users must modify the permission of their home directory to 711, which is potentially exploitable.
Guides  

As root (su -) on the HN, edit /etc/httpd/conf/httpd.conf and comment out the line UserDir disable, then uncomment the line UserDir public_html. It's also strongly recommended to uncomment the lines in the immediately following <Directory /home/*/public_html> section. Restart the httpd service:

/etc/init.d/httpd restart

Users can follow this guide to use this functionality.

Change the default umask

Description Change the umask for normal users so they will create files and directories which are self-anything, group-readable, and world-nothing.
Dependencies - Rocks installed on HN
Notes This is not mandatory. However, if you have users which aren't in CMS, you may wish to prevent them from being able to read files made by users in CMS. We remove non-CMS users from the "users" group (putting them in their own group), then prevent files from being world readable by default.
Guides  

As root (su -) on the HN, edit /etc/bashrc and replace the lines:

else
    umask 022

with:

elif [ "root" = "`id -un`" ]; then
    umask 022
else
    umask 027

and edit /etc/csh.cshrc and replace the lines:

else
umask 022

with:

else if ("root" == "`id -un`") then
umask 022
else
umask 027

We also edit these files on all other nodes as a part of the Kickstart using extend-compute.xml.


 

Kerberos:

Description Configure Kerberos for getting tickets from FNAL & CERN.
Dependencies - Rocks installed on HN
Notes User instructions for kerberos authentication are given here.
Guides  

Configure Kerberos on the HN. As root (su -) on the HN:

  1. To enable FNAL tickets, save this file as /etc/krb5.conf.
  2. Configure ssh to use Kerberos tickets:
    Add the following to /etc/ssh/ssh_config inside the Host * section:
    GSSAPIAuthentication yes
    GSSAPIDelegateCredentials yes

    Restart the ssh service:
    /etc/init.d/sshd restart
  3. We add aliases the skeleton .cshrc & .bashrc files so all new users have the commands appropriate for FNAL & CERN:
    Add to /etc/skel/.cshrc:
    # Kerberos
    alias kinit_fnal '/usr/kerberos/bin/kinit -A -f'
    alias kinit_cern '/usr/kerberos/bin/kinit -5 -A'
    Add to /etc/skel/.bashrc and to ~root/.bashrc:
    # Kerberos
    alias kinit_fnal='/usr/kerberos/bin/kinit -A -f'
    alias kinit_cern='/usr/kerberos/bin/kinit -5 -A'

Configure Kerberos on all other nodes. As root (su -) on the HN:

  1. Copy krb5.conf to where it can be served from the HN during the Kickstart of other nodes:
    cp /etc/krb5.conf /export/rocks/install/contrib/5.4/x86_64/RPMS/krb5.conf
    Make sure the file is world readable:
    chmod 644 /export/rocks/install/contrib/5.4/x86_64/RPMS/krb5.conf
  2. Edit /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml and add to the <post> section:
    wget -P /etc http://<var name="Kickstart_PublicHostname"/>/install/rocks-dist/x86_64/RedHat/RPMS/krb5.conf
    <file name="/etc/ssh/ssh_config" mode="append">
            GSSAPIAuthentication    yes
            GSSAPIDelegateCredentials yes
    </file>
    
  3. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  4. Reinstall all nodes on which you want these changes

 

Garbage collection:

Description Configure cron to garbage collect /tmp.
Dependencies - Rocks installed on HN
Notes  
Guides  

First, create a cron job on the HN. As root (su -) on each, edit /var/spool/cron/root and add the line:
6 * * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;

This will remove day-old files in /tmp on the HN every hour on the 6th minute and week-old empty directories in /tmp every Saturday at 2:36.

Now create the cron job on all other nodes:

  1. Edit /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml and place the following commands inside the <post></post> brackets:
    <!-- Create a cron job that garbage-collects /tmp -->
    <file name="/var/spool/cron/root" mode="append">
    6 2 * * * find /tmp -mtime +1 -type f -exec rm -f {} \;
    36 2 * * 6 find /tmp -depth -mtime +7 -type d -exec rmdir --ignore-fail-on-non-empty {} \;
    </file>
  2. Create the new distribution:
    cd /export/rocks/install
    rocks create distro
  3. Reinstall all nodes on which you want these changes

 

Condor

Description Install, configure & monitor condor on all nodes
Dependencies - Rocks installed on HN
Notes There are many mechanisms by which to install Condor, we used yum. We install Condor via Rocks Kickstart, using different Condor configuration files for each appliance. Condor keeps logs in /var/log/condor. Generally, the most information can be found on the HN or the node which serviced (not submitted) the job you are attempting to get info on. Note that we are changing to install condor as part of the OSG Release3 stack, so the configuration instructions may not be up to date and the installation instructions are not up to date.
Guides - OSG Tier-3 Condor installation guide
- Condor 7.4 administrator's manual: configuration
- Condor yum repository
- OSG Release3 Condor installation instructions

Each step below details how to install, configure, and start Condor on both the head node and the nodes which are installed via Rocks Kickstart. After configuring the Rocks Kickstart files to install, configure, and start Condor, the new Rocks Kickstart files can be made and the nodes can be reinstalled.

Note the grid node is not intended for multiple reinstalls, so the instructions as provided for the grid appliance are intended for the first install of the GN via Kickstart. Otherwise, the installation of Condor on the GN must be done manually, similar to what is done for the HN, but with a modified configuration file.

Install Condor

As root (su -) on the HN, install Condor on the HN:

  1. Navigate to the Condor yum repository page and put the .repo file for Condor into the yum repo directory on the HN. Specifically:
    cd /etc/yum.repos.d
    wget "http://www.cs.wisc.edu/condor/yum/repo.d/condor-stable-rhel5.repo
    "
  2. Now install Condor on the HN:
    yum install condor.x86_64

As root (su -) on the HN, configure Rocks to install Condor on the remaining nodes:

  1. The .repo file will also need to be served to all the other nodes via Kickstart, so copy it to the Rocks contrib directory and make it world readable:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
    wget "wget http://www.cs.wisc.edu/condor/yum/repo.d/condor-stable-rhel5.repo"

    chmod 644 condor-stable-rhel5.repo
  2. Edit /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml, which all other appliances inherit from, and add the lines to the <post> section:
    cd /etc/yum.repos.d
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/condor-stable-rhel5.repo &gt;&gt; /root/yum-install.log 2&gt;&amp;1
    yum -y install condor.x86_64 &gt;&gt; /root/yum-install.log 2&gt;&amp;1

Configure Condor

As root (su -) on the HN, edit the condor configuration file /etc/condor/condor_config.local following the Condor configuration manual or use our configuration files as an example. If using our configuration files, be sure to replace all the references to hepcms-hn with the name of your own HN.

We use three configuration files for Condor which are nearly identical, with the most significant difference being the setting DAEMON_LIST, which controls which services start on which nodes (thereby indicating what roles they take).

We handle three issues in the configuration files: (1) CMSSW jobs cannot be evicted and resumed without loss of compute cycles (they don't support checkpointing), (2) Gratia from OSG wants job reporting sent to a particular directory, and (3) CRAB needs settings for job submission when using Condor as the submission agent.

  1. To prevent CMSSW jobs from being evicted, the following lines are present in the Condor configuration files:
    PREEMPTION_REQUIREMENTS = False
    NEGOTIATOR_CONSIDER_PREEMPTION = False
    CLAIM_WORKLIFE = 300
    WANT_SUSPEND = True
    SUSPEND = ( (CpuBusyTime > 2 * $(MINUTE)) \
    && $(ActivationTimer) > 300 )
    CONTINUE = $(CPUIdle) && ($(ActivityTimer) > 10)
    PREEMPT = False

  2. To tell Gratia where to send job reports:
    PER_JOB_HISTORY_DIR = /sharesoft/osg/ce/gratia/var/data
  3. On nodes which submit CRAB jobs (the interactive nodes), additional settings are needed:
    GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 20
    ENABLE_GRID_MONITOR = TRUE
    GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 1000
  4. For OSG jobs, the following is required:
    DELEGATE_JOB_GSI_CREDENTIALS = False

If using our configuration files, copy them to the appropriate directories. As root (su -) on the HN:

  1. After editing the HN configuration file for your own HN, HN_condor_config.local, name the file /etc/condor/condor_config.local.
  2. After editing the compute and interactive configuration files, compute_condor_config.local and interactive_condor_config.local, for your own HN, copy them both to /export/rocks/install/contrib/5.4/x86_64/RPMS, where they can be served to the nodes via Rocks Kickstart. Make sure they are world readable.

Now edit the Rocks Kickstart files to get the appropriate configuration file and to append the node specific NETWORK_INTERFACE at the end of each one. As root (su -) on the HN:

  1. Edit /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml and add the lines to the <post> section after the Condor installation section:
    cd /etc/condor
    mv condor_config.local condor_config.local.original &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/compute_condor_config.local -O condor_config.local &gt;&gt; /root/condo -install.log 2&gt;&amp;1
    ipaddr=`ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | cut -d : -f 2`
    echo NETWORK_INTERFACE = $ipaddr &gt;&gt; condor_config.local
    cd -
  2. The interactive nodes need a different configuration file than the compute nodes. Edit /export/rocks/install/site-profiles/5.4/nodes/interactive.xml and add the lines to the <post> section:
    cd /etc/condor
    mv condor_config.local condor_config.local.original &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/interactive_condor_config.local -O condor_config.local &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    ipaddr=`ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | cut -d : -f 2`
    echo NETWORK_INTERFACE = $ipaddr &gt;&gt; condor_config.local
    cd -
  3. Since the grid appliance doesn't inherit from the interactive appliance, these lines need to be repeated in the grid appliance Kickstart file. Edit /export/rocks/install/site-profiles/5.4/nodes/grid.xml and add the lines to the <post> section:
    cd /etc/condor
    mv condor_config.local condor_config.local.original &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/interactive_condor_config.local -O condor_config.local &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    ipaddr=`ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | cut -d : -f 2`
    echo NETWORK_INTERFACE = $ipaddr &gt;&gt; condor_config.local
    cd -

Start and monitor condor:

As root (su -) on the HN, start Condor:

  1. We want Condor to start whenever the node reboots:
    /sbin/chkconfig --add condor
    chkconfig condor on
  2. Start Condor, check that it's running using ps, and compare the output of ps to the text on the Condor yum repository page:
    service condor start
    ps -ef | grep condor

As root (su -) on the HN, configure Rocks to start Condor on all other nodes:

  1. Edit /export/rocks/install/site-profiles/5.4/nodes/extend-compute.xml, which all other appliances inherit from, and add the lines to the <post> section after the Condor configuration section:
    /sbin/chkconfig --add condor &gt;&gt; /root/condor-install.log 2&gt;&amp;1
    chkconfig condor on &gt;&gt; /root/condor-install.log 2&gt;&amp;1

We create a simple condor monitoring script that will route output to the web server, to be viewed by users. As root (su -) on the HN:

  1. Create the file /root/condor-status-script.sh with the contents:
    #!/bin/bash
    . /root/.bashrc
    OUTPUT=/var/www/html/condor_status.txt
    echo -e " \n\n" >$OUTPUT
    echo -e "As of `date` \n">>$OUTPUT
    /usr/bin/condor_status -submitters >>$OUTPUT
    /usr/bin/condor_userprio -all >>$OUTPUT
    /usr/bin/condor_status -run >>$OUTPUT
  2. Run it every 10 minutes by editing /var/spool/cron/root and adding the line:
    1,11,21,31,41,51 * * * * /root/condor-status-script.sh
  3. Output will be here.

Kickstart nodes:

Now that all the Rocks Kickstart files have been edited to install, configure, and start Condor, create the new Rocks distribution and Kickstart all the non-HN nodes. As root (su -) on the HN:

  1. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  2. Reinstall the other nodes.

 

Hadoop

Description Install HDFS to federate multiple disks on multiple nodes into a single volume. This federated volume will serve as our OSG storage element (SE). These instructions are for OSG Hadoop 0.2.
Dependencies - Hadoop namenode appliance Kickstarted
Notes Be sure to read the OSG Hadoop planning document to obtain a clear understanding of a Hadoop namenode, secondary namenode, and datanode. Decide which hosts will run which types of Hadoop nodes. We choose to run our Hadoop namenode on the SE appliance, our secondary namenode on one of our INs, and our datanodes on our WNs.
Guides - OSG Hadoop planning document
- OSG Hadoop installation
- OSG Hadoop validation (new, as of August 2011, Hadoop guide)
- ATrpms fuse repository
- Primary SL5.4 x86_64 RPM server

Install Hadoop:

Hadoop rpms should be installed on every node which will be in the Hadoop pool, managing the Hadoop pool, or contacting the Hadoop pool. All our WNs are in our Hadoop pool. Our Hadoop namenode is our SE appliance and the Hadoop secondary namenode is one of our INs. All nodes which need to read files in the Hadoop pool also need the Hadoop rpms, especially our GN and INs.

We do not install Hadoop on the HN (the Rocks frontend) because it doesn't need to talk to the Hadoop volume. Additionally, it may not be advisable to install the FUSE Kernel rpms on a Rocks frontend. Therefore, if you opt to install the Hadoop rpms on the HN, be cautious about installing FUSE on the HN.

We install the osg-hadoop, hadoop, and hadoop-fuse packages during Rocks Kickstart via commands in extend-compute.xml by following the OSG Hadoop installation instructions.

If you wish to install Hadoop manually:

  1. As root on the node where you want to install:
    rpm -Uvh http://vdt.cs.wisc.edu/hadoop/osg-hadoop-20-3.el5.noarch.rpm
    yum install hadoop-0.20-osg

The base Hadoop packages can be installed via Rocks Kickstart by doing the following as root on the HN:

  1. Download the file http://vdt.cs.wisc.edu/hadoop/osg-hadoop-1-2.el5.noarch.rpm to /export/rocks/install/contrib/5.4/x86_64/RPMS.
  2. Add the line <package>osg-hadoop</package> to extend-compute.xml.
  3. Edit extend-compute.xml to call the yum install commands inside the <post>...</post> section:
    yum install hadoop-0.20-osg

Configure Hadoop:

We configure Hadoop during Kickstart using commands in extend-compute.xml by following the OSG Hadoop installation instructions. Here is our generic /etc/sysconfig/hadoop file: hadoop-config (this file has blank or dummy values which you will need to change). We configure Hadoop to run the primary namenode on our SE node (SE-0-1) and the secondary namenode one of our INs (interactive-0-0).

Note that HADOOP_DATA only needs to be set on your Hadoop datanodes. For the rest, leave HADOOP_DATA blank. Alternatively, HADOOP_DATA can be set on other nodes (and won't be used), but HADOOP_DATA must point to a location that exists on the node.

These instructions must be followed for both manual installations and Kickstart installations of Hadoop:

  1. We run Hadoop as a user other than root. Create this user following these instructions.
  2. We use /share/apps/hadoop/checkpoint as our NFS-mounted checkpoint directory. /share/apps is only writeable by root on the HN, so the hadoop subdirectory needs to be created on the HN and be owned by your newly made hadoop user. On the HN as root:
    mkdir /share/apps/hadoop
    chown yourhadoopuser:users /share/apps/hadoop

If you wish to configure Hadoop manually:

  1. Once you've edited the Hadoop configuration file, call the configuration script:
    service hadoop-firstboot start
  2. Hadoop can create very large log files (up to 10GB) and doesn't pick up the environment setting HADOOP_LOG from the configuration file. We modify the file /etc/hadoop/conf/hadoop-env.sh and set the variable HADOOP_LOG_DIR with the desired directory in extend-compute.xml, but this can easily be done by hand. Calling hadoop-firstboot again will not overwrite this file, but software updates to Hadoop may overwrite it.

Hadoop can be configured via Rocks Kickstart by doing the following as root on the HN:

  1. Create the file /export/rocks/install/contrib/5.4/x86_64/RPMS/hadoop-config (be sure to edit the linked file as it has dummy entries). Make sure the file is world readable:
    chmod 644 /export/rocks/install/contrib/5.4/x86_64/RPMS/hadoop-config
  2. Edit the <post> section of extend-compute.xml to get the base configuration file:
    mv /etc/sysconfig/hadoop /root/hadoop-config.original
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/hadoop-config -O /etc/sysconfig/hadoop
  3. Edit extend-compute.xml to call the configuration script:
    /etc/init.d/hadoop-firstboot start
  4. HADOOP_DATA doesn't need to be set on anything other than Hadoop datanodes. Since the /hadoop1 and /hadoop2 partitions don't even exist on nodes other than our datanodes, the value for HADOOP_DATA must change for these nodes. So we leave it empty. Edit SE.xml, grid.xml, and interactive.xml and add:
    cp /etc/sysconfig/hadoop /root/hadoop-config.forDNs
    cat /etc/sysconfig/hadoop | sed -e s~"HADOOP_DATA=/hadoop1/data,/hadoop2/data"~"HADOOP_DATA="~ &gt; /tmp/hadoop-config
    cp -f /tmp/hadoop-config /etc/sysconfig/hadoop
    /etc/init.d/hadoop-firstboot start
  5. Edit extend-compute.xml to modify the file /etc/hadoop/conf/hadoop-env.sh:
    cp /etc/hadoop/conf/hadoop-env.sh /root/hadoop-env.sh.original
    cat /etc/hadoop/conf/hadoop-env.sh | sed -e s~"#HADOOP_LOG_DIR=/var/log/hadoop"~"HADOOP_LOG_DIR=/scratch/hadoop/log"~ > /tmp/hadoop-env.sh
    cp -f /tmp/hadoop-env.sh /etc/hadoop/conf/hadoop-env.sh

Start Hadoop:

The Hadoop service only needs to run on the primary namenode (our SE appliance), secondary namenode (one of our INs), and datanodes (our WNs). It does not need to run on any other nodes. However, the Hadoop software must be installed on every node that needs to access the Hadoop volume.

We start Hadoop on node boot via Kickstart commands in extend-compute.xml. The new OSG Hadoop guide has lots of details on how to validate that Hadoop is operating correctly (WARNING: this link will probably move in the near future, you may need to find it on your own). Note that the web interface is only accessible via your internal network, so you will need to open a web browser from a node in your cluster.

If you wish to start Hadoop manually, as root on the node where Hadoop has been installed and configured:

  1. Start manually:
    service hadoop start
  2. The Hadoop service can be told to start during node boot:
    chkconfig hadoop on

The Hadoop service can be automatically started via Rocks Kickstart by doing the following as root on the HN:

  1. Turning the Hadoop service on when the node boots in extend-compute.xml:
    chkconfig hadoop on
  2. Some nodes don't need the Hadoop service to be turned on. The GN does not need the Hadoop service running. Add to grid.xml:
    chkconfig hadoop off
  3. All but one of the INs does not need the Hadoop service running either. Add to interactive.xml:
    if [ `/bin/hostname -s` != interactive-0-0 ] ; then
         chkconfig hadoop off
    fi

Mount the Hadoop volume with FUSE:

FUSE must be installed on the node which will be the SRM server (for us, the GN). For CMSSW jobs, it is unclear if FUSE must run on the WNs (so that traditional file paths work). It may be possible to configure site-local-config.xml to access files using hadoop commands, but we don't know how, so opt to install FUSE on our WNs as well. We also want to install FUSE on the Hadoop namenode and interactive nodes for admin/user ease of access.

FUSE doesn't like multiple versions of the JDK library. Remove all but x86_64, e.g.:

rpm -e --nodeps jdk.i586 jdk.x86_64
yum install jdk.x86_64

As well as adding the following lines to extend-compute.xml:

rpm -e --nodeps jdk.i586 jdk.x86_64 &gt;&gt; /root/hadoop-install.log 2&gt;&amp;1
yum install jdk.x86_64 &gt;&gt; /root/hadoop-install.log 2&gt;&amp;1

FUSE requires a Kernel package from ATrpms which must match the currently installed Kernel. Every time a Kernel update is installed, a new FUSE Kernel package must also be installed. In the case of Kickstarts, every time a new FUSE Kernel package is installed via yum, it must also be placed in the contrib directory (below) and the rpm call in extend-compute.xml must be modified to match the new version number. Then a new Rocks distribution must be made for the next time nodes are Kickstarted.

As root on the node where you wish to install FUSE manually and mount the volume:

  1. Get the Kernel version which is currently running on the node:
    uname -r
  2. Navigate to the ATrpms fuse repository and select the fuse-kmdl rpm with number that matches the currently running Kernel version. Download it. E.g.:
    wget "http://dl.atrpms.net/all/fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm"
  3. Install it:
    rpm -ivh fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm
  4. Navigate to the ATrpms fuse repository and select the most recent fuse rpm. Download it and install it, e.g.:
    wget "http://dl.atrpms.net/all/fuse-2.7.4-8_12.el5.x86_64.rpm"
    rpm -ivh fuse-2.7.4-8_12.el5.x86_64.rpm
  5. Edit /etc/fstab to add the Hadoop volume with whatever mount point name desired. We use /hadoop:
    hdfs# /hadoop fuse server=SE-0-1,port=YourHADOOP_NAMEPORT,rdbuffer=32768,allow_other 0 0
  6. Make the mount point directory:
    mkdir /hadoop
  7. Mount the volume:
    mount /hadoop
    This error message is normal and can be ignored:
    fuse-dfs didn't recognize /hadoop,-2
    fuse-dfs ignoring option allow_other

FUSE can be automatically installed and the volume can be mounted via Rocks Kickstart by doing the following as root on the HN:

  1. We call "yum update" in extend-compute.xml. Thus, the latest Kernel will be installed on a Kickstarted node. The latest Kernel version can be determined by examining the SL repo or by calling commands on any node in the cluster:
    1. Check to see if a new Kernel might be installed on this node, since Kickstart will always get the latest Kernel:
      yum check-update
    2. If this indicates a new Kernel will be installed, use this as your reference version. If it does not, get the most recent Kernel version which is currently installed on this node:
      rpm -qa | grep kernel
  2. Go to the Rocks contrib directory:
    cd /export/rocks/install/contrib/5.4/x86_64/RPMS
  3. Navigate to the ATrpms fuse repository and select the fuse-kmdl rpm with number that matches the Kernel version that will be installed during Kickstart. Download it. E.g.:
    wget "http://dl.atrpms.net/all/fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm"
  4. Edit extend-compute.xml to install this specific FUSE Kernel module. A simple "yum install fuse-kmdl" will not work - the package name for the example rpm above is "fuse-kmdl-2.6.18-238.19.1.el5", so extend-compute.xml must be modified for every new fuse-kmdl release. In the <post> section, add the lines (modify for different release numbers):
    wget http://&Kickstart_PublicHostname;/install/rocks-dist/x86_64/RedHat/RPMS/fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm
    rpm -ivh fuse-kmdl-2.6.18-238.19.1.el5-2.7.4-8_12.el5.x86_64.rpm
  5. Navigate to the ATrpms fuse repository and select the most recent fuse rpm. Download it. E.g.:
    wget "http://dl.atrpms.net/all/fuse-2.7.4-8_12.el5.x86_64.rpm"
  6. Edit extend-compute.xml to install FUSE by adding:
    <package>fuse</package>
  7. Edit extend-compute.xml to mount the Hadoop volume with FUSE. In the <post> section, add the lines:
    <file name="/etc/fstab" mode="append">
    hdfs# /hadoop fuse server=SE-0-1,port=YourHADOOP_NAMEPORT,rdbuffer=32768,allow_other 0 0
    </file>
    mkdir /hadoop
    mount /hadoop

Kickstart nodes:

Now that all the Rocks Kickstart files have been edited to install, configure, and start Hadoop, create the new Rocks distribution and Kickstart the SE (hadoop namenode), INs, and WNs. As root (su -) on the HN:

  1. Create the new Rocks distribution:
    cd /export/rocks/install
    rocks create distro
  2. Reinstall the nodes.