April-September 2011 Log

Outage August 5-7 including software and firmware upgrades. Make more user groups. Debug R510 nodes not coming online after power outage. CRAB upgrade. Debug SE-0-1 memory problems. Install Hadoop. Install GUMS. Outage September 21-22 to install two new R510 nodes, and replace PDU/strip hardware.

To Do & Sandbox

September 21-22, 2011

MT, JW, ALM, & GF -- Re-arranged hardware, replaced middle PDUs, replaced 2 of 4 power strips, installed new R510 nodes

Removed old network switch.
Removed unused second ethernet cables on R510s (if port bonding in the future, will need to put them back).
Replaced middle of rack PDUs with new ones. Replaced top two (left and right) power strips connected to those PDUs. Top right power strip showed smoke damage along one cord (same strip that had problems in April and July.) Note that it was not possible to remove bottom two power strips (bottom left 0U and middle right 0U) due to no space available. Decided not to replace them at this stage.
Moved KVM switches down
Noted that display cord on KVM had wear, covered with electrical tape. Will need to consider replacing and wrapping something around the cord in the future.

MT, JW -- Powered on, kickstart new R510 nodes, debug grid services

Powered on, note that the HN and GN plugs are "touchy" in the bottom left into rack (difficult to fit your arm in there) PDU. Cannot seem to get them seated any better. Best not to touch them unless necessary.
Kickstarted two new R510 nodes. Had to kickstart nodes with xml developed to format partitions (note that compute and R510s have different .xml files for that). These files are backed up on the HN in /root. Kickstarted nodes a second time with default R510 xml file including --noformat so that they are in the same state with Hadoop as the other R510s.
Grid services needed GUMS to come up on HN before OSG start on GN. Changed recovery procedure page to reflect that. Problems Malina had previously with post-GUMS install seemed to have been fixed by the full restart of cluster & services. SAM tests are finishing at reasonable speed again.

September 2011

MK -- Installed and Configured GUMS

Documented here.

September 9, 2011

MK -- Attempted to install gridftp-hdfs

Had tried gridftp-hdfs installation previously (prior to Aug 26), but had run into unknown problems. Note that in previous install, I was able to get the BeStMan that came with OSG pacman to contact the gridftp-hdfs server (just talks to whatever =service is on that port), so we don't have to install BeStMan via rpm.
Regardless, couldn't get the gridftp-hdfs service to work itself, tested with globus-url-copy file:///`pwd`/testfile.txt gridftp://hepcms-0.umd.edu:2811/hadoop/....
Tried again today. Encountered intermediate problem with the /etc/grid-security/certificates directory. Was able to repair.
The following got gridftp-hdfs installed without any apparent conflicts with existing services:
- Turn off current OSG CE & SE, protect the certificates dir by removing the symlink (will add it back later):
  . /sharesoft/osg/ce/setup.sh
  vdt-control --off
  unlink /etc/grid-security/certificates
- In a fresh shell, install gridftp-hdfs, remove the packages we know will conflict with the OSG CE & SE, and fix the certificates directory:
  yum install gridftp-hdfs
  rpm -e --nodeps osg-ca-certs fetch-crl
  cd /etc/grid-security
  rmdir certificates
  ln -s /sharesoft/osg/ce/globus/share/certificates
- Added the following lines to /etc/gridftp-hdfs/gridftp-hdfs-local.conf:
  export GRIDFTP_HDFS_MOUNT_POINT=/hadoop
  export TMPDIR=/tmp
  export GRIDMAP=/etc/grid-security/grid-mapfile
- Restarted gridftp-hdfs to pick up the new configuration:
  service xinetd restart
But the service wouldn't work. globus-url-copy always gave the error:
530 530-Login incorrect. : globus_gss_assist: Error invoking callout
530-globus_callout_module: The callout returned an error
This looks to me like it can't contact the prima callout for GUMS. Despite my attempts to get gridftp-hdfs talking to grid-mapfile, I couldn't figure out where the settings were to do so. Postponed further work on gridftp-hdfs until we can install GUMS.

September 7, 2011

MK -- Fixed R510-0-5 /hadoop6

After cluster reboot, R510-0-5 refused to come up, claiming that /hadoop6 was having problems. Reformatted the partition by removing it and creating it again:
fdisk /dev/sdf
(various command options)
mkfs.ext3 /dev/sdf1
e2label /dev/sdf1 /hadoop6

August 26, 2011

MK -- Upgraded Hadoop to Hadoop 0.20

Modified Admin guide to match all the new details. Specifically, the package name has changed to hadoop-0.20-osg, no need to specifically install hadoop-fuse, the new FUSE release doesn't like multiple JDK releases, and the hadoop environment file in which the log directory gets set moved.
Had some issues with upgrading the old hadoop volume to the new hadoop release, but ended up stumbling across a solution similar to that already documented in the upgrade guide.
Also had some problems with the hadoop rpm creating a local hadoop user instead of using the one that already existed. Ran a "rocks sync users" from the HN. Ended up with Hadoop services and directories owned by a no-longer existent user. Had to manually log into all nodes involved with hadoop to kill processes and change directory permissions. R510-0-5 /hadoop6 ended up with serious permissions problems that appeared to be fixed, but generated errors later.

August 19-20, 2011

MK -- Kickstarted R510-0-2, R510-0-6

August 18, 2011

MK -- Installed FUSE via Kickstart, Kickstarted most nodes, copied /store data over to Hadoop volume, checked /hadoop permissions

Decided to install FUSE via Kickstart because the WNs are going to need it to access the new /hadoop volume.
Most nodes have now been Kickstarted, with the exception of R510-0-2 and R510-0-6, which are still running jobs.
Copied files in /store over to /hadoop. Gave correct user permissions. Did not copy /store/user/mkirn over as won't be needed soon. Also need to double check that /store/user/eberry is complete once we officially switch over as this directory may be under active modification.
Was able to preserve permissions of directories, so /store/subdir/* is not readable by non-user group accounts.

August 16, 2011

MK -- Hadoop via Kickstart

Configured all Kickstart files to install, configure, and run Hadoop services (when appropriate). Created new distro. Plan to Kickstart all WNs (remembering to call rocks remove host partition and the nukeit script), but currently our Condor pool is completely full.

August 12, 2011

MK -- Hadoop

Running hadoop namenode on SE-0-1.
Having trouble with secondary namenode. Originally tried running on HN, but hadoop logs showed:
- On SE-0-1:
  2011-08-12 10:31:13,826 INFO namenode.FSNamesystem
  (FSNamesystem.java:rollEditLog(4298)) - Roll Edit Log from 10.1.1.1
- And on HN:
  2011-08-12 10:31:14,059 ERROR namenode.SecondaryNameNode
  (SecondaryNameNode.java:run(229)) - Exception in doCheckpoint:
  2011-08-12 10:31:14,061 ERROR namenode.SecondaryNameNode
  (SecondaryNameNode.java:run(230)) -
  java.io.FileNotFoundException:http://SE-0-1:50070/getimage?putimage=1&port=50090&machine=128.8.164.11&token=-lotsofnumbers
- So the HN is contacting the SE on the internal IP 10.1.1.1, but it's sending requests for the hadoop image claiming that it's IP is 128.8.164.11, which is the HN external IP. I don't know how to tell Hadoop that the machine IP is the internal address.
- Was never able to fix this - since "hostname -s" output matches both public and private hostnames, Hadoop seems to be confused on the interface to run on. This seems to be true even when I explicitly tell it the secondary namenode is hepcms-hn.local, then modify the startup script to append .local to the output of "hostname -s". So gave up on putting the secondary namenode on the HN.
Tried running the secondary namenode on one of the WNs instead. I didn't want to lose the disk space, so figured out how to run both a secondary namenode and datanode on the same host. Of course, if the node goes down for any number of reasons, we'll lose both a datanode and secondary namenode. It's not recommended, but not out of the question. To get this working, I had to edit the /etc/init.d/hadoop script such that it didn't exclusively start one of (1) primary namenode, (2) secondary namenode, or (3) datanode. Was relatively simple to do, just had to turn bash "elif" commands into "if". I also wanted it to be elegant, so adjusted various logic for returning exit codes, etc. Though probably not strictly necessary.
Ultimately decided to run the secondary namenode on an interactive node. The secondary namenode needs enough RAM to contain the entire Hadoop volume. Sometimes the RAM on the WNs can be almost completely consumed, whereas Ganglia indicates consumed RAM has never exceeded 1 GB on the INs within the last year.
To get fuse to mount the hadoop volume, I had to install the fuse kernel module from ATrpms. Since the version installed has to match the current kernel, it means every time we do a yum update that updates the kernel, we'll have to manually download the new fuse rpms and install. It will be a pain, so I've decided to install and run fuse only on the SE, GN, and two INs.

August 11, 2011

MK -- WN Kickstart partitions

During August 3 updates, tested code in new extend-compute.xml file to make sure the modifications were functional by Kickstarting compute-0-2. The modifications were fine (after dealing with the following issues). Discovered that the compute nodes aren't coming up. shoot-node appeared to fail altogether, suspect due to BMC DHCP requests getting intercepted by HN and being interpreted as Kickstart requests. Additionally, subsequent tests using PXE boot+insert-ethers showed that compute nodes were not OK with the "preserve partitions" scheme currently implemented in replace-partition.xml. Eventually brought up compute-0-2 as compute-0-14 using the "force-default" scheme and left the node for subsequent testing.
Had seen errors about not enough room for the "/" partition in the past and had solved using --bytes-per-inode=4096 --fstype="ext3", used these in replace-partition.xml and Kickstarted compute-0-14 again with the "format for Hadoop" partitioning scheme. The node still came up with the same error. I reviewed the 'fix' we had earlier for the R510s and realized it probably wasn't the --bytes-per-inode and --fstype settings that fixed the problem. It's the fact that my previous --ondisk commands were written as "--ondisk sda" but needed to be written as "--ondisk=sda". The Rocks partition guide indicates that spaces are OK, but apparently this is not the case.
Brought up compute-0-2 (now compute-0-14) using the "=" syntax and saved this file as /root/replace-partition.xml.format.
Long story short, if a tag in the partitions Kickstart commands takes an argument, always use an "=", don't use a space!
Installed the kickstart configurator on HN to see if it recommended any other fixes. This is how I noticed the equality was missing.
I did use the equalities in replace-partition.xml the very first time I Kickstarted compute-0-2. However, either the problems with it not getting its Kickstart file in the first place or me not calling "rocks remove host partition compute-0-2" and the nukeit script introduced the issues with the "\" not enough size error. I used the original replace-partition.xml file which preserves the Hadoop partitions on the compute nodes to Kickstart compute-0-2 (now compute-0-14) again. But this time, I called "rocks remove host partition compute-0-14", the nukeit script (ssh compute-0-14 'sh /share/apps/sbin/nukeit.sh'), then did a manual Kickstart instead of a shoot-node (ssh compute-0-14 '/boot/kickstart/cluster-kickstart'). The node came up successfully. I was able to watch it come up using "rocks-console compute-0-14", though obviously this command didn't work until compute-0-14 received its Kickstart file.
Long story short, if replace-partition.xml changes, you must call "rocks remove host partition", even if replace-partition.xml is just preserving existing partitions. You don't need to call "rocks remove host partition" for every Kickstart if replace-partition.xml did not change.

August 9, 2011

MK -- Running BeStMan as different user

User daemon isn't in the "users" group, so couldn't read files in /data/se/store. To prevent files on SE from being world-readable, I ran BeStMan as a 'normal' user in the users group.
Documented here.

August 5-7, 2011

MT -- Upgrades and cluster power outage

Backed up HN and GN important files and yum updated nodes software. Firmware update on HN, GN, IN, and compute nodes.
- R510s and SE failed firmware update with message: Could not parse output, bad xml for package: dell_dup_componentid_00159.
Tried to update OSG, had trouble with configuration, was not able to get it to work. Rolled back to previous backed up OSG (cp -pr). Note that OSG currently installed tends to disable gratia-gridftp-transfer, best to do vdt-control --off; vdt-control --list to see if it's enabled or not, use vdt-control --enable gratia-gridftp-transfer to enable if needed.
Brought nodes up following recovery guide, verified systems working correctly.
- Note R510-0-1 reporting: snmpd[7272]: Got a trap from peer on fd 10
- Note R510-0-5 reporting: snmpd[7274]: Got a trap from peer on fd 10
- Note interactive-0-0 and interactive-0-1 reporting: snmpd[4546]: Got a trap from peer on fd 10; snmpr[4547]: looks like a 64bit wrap, but prev !=new
- Note SE-0-1 reporting: snmpd[7055]: accepted smux peer: oid SNMPv2-SMI::enterprises.674.10892.1,descri; Systems Management SNMP MIB Plug-In Manager snmpd[7055]: Got trap from peer on fd 10

August 3, 2011

MK -- Change groups to accomodate different user types, modified umask, file permissions, modified vdt-local-setup.(c)sh

Modified umask settings in /etc/bashrc and /etc/csh.cshrc such that normal users have default umask of 027 instead of 022, as it was before. Kept root umask as 022. Manually edited these two files on *all* nodes.
Edited extend-compute.xml to include umask changes to these two files during Kickstart.
Documented umask settings here, as well as uploading new extend-compute.xml and modifying admin guide for Kickstarting the WNs with info on the umask settings.
Applied 750 permissions to most directories in /data and 711 permissions to most directories in /home. Modified /home/*/.globus/*, /home/*/public_html, and /data/groups/* permissions correctly.
Because permissions of /data/se/store changed, added some info to the admin and user guides in various places about grid accounts and giving permissions to grid accounts for ownership of folders in /data/se/store/users: [1, 2, 3].
Also added globus ports to wnclient version of vdt-local-setup.(c)sh, which possibly/probably repaired RSV default status probe failure below

July 29, 2011

MT -- Edit vomses file

remove cms voms.cern.ch from vomses file

July 14, 2011

MT -- Switch CRAB link to 2_7_8_patch1

Testing is sucessful, switch main alias to new CRAB

July 13, 2011

MT --Debuging failing RSV probe

Switched globus port ranges on to 20000,25000 in vdt-local-setup.*sh. Turned vdt services off/on. Something went awry in configure and bestman was disabled and gratia-gridftp-transfer enabled (different settings than my log from Feb 2011). Changed to bestman enabled and gratia-gridftp-transfer disabled, restart vdt services. Did not fix failing RSV probe.

July 11, 2011

MT & MK -- Reboot GN

Having RSV probe org.osg.batch.jobmanager-default-status fail (stdout error). Debugged, investigated with GOC trouble ticket, decided to reboot GN. Short downtime, services came back up. RSV probe default-status is still failing but doesn't seem to affect successful running of grid jobs.

July 6, 2011

MT -- Debug down R510 nodes & memory problem SE-0-1

SE-0-1 has been offline for some time. It reports on the front panel the following error: "E2110 Multibit error DIMM A3, reseat the DIMM". Reseated DIMM, seems to work. Did extra effort by unplugging everything and replugging everything - will need to keep that in mind when installing/debugging Hadoop in case things not seated properly, although I believe they are corrrect.
Note that SE-0-1 gives the following message on reboot: "Get trap from peer on fd 10".
All nodes except R510-0-2 went down 3 July 18:28, and the remainder of the R510s did not come back up. They did not power on, and showed amber system status LED. Note R510-0-2 and R510-0-5 are in the same upper right half power strip(which went down April 25). They did not come on after reseating plugs. Reset power strips (top right and top left) of downed nodes with white button (note R510-0-2 was hosting grid jobs and continued without problem during power strip reset). Found I had to remove power completely from downed nodes, plug back in and they rebooted, and cluster seems to be happily hosting many grid jobs on all compute nodes (R510s included). Note that power strip for bottom R510 was plugged in elsewhere, so that strip was not reset, but node came back up after unplug/replug. Need to consider replacing power strips.

June 29, 2011

MT -- Updated CRAB

Locally updated CRAB to 2_7_8_patch1, tested, did not change soft link for user version yet as there may be problems, will test over the next week more extensively.

April 25, 2011

MT -- Debug down R510 nodes

System alertlog shows all nodes rebooted 24 April 16:50 (thunderstorm). R510 nodes 2 and 5 had orange lights on front and back showing not getting sufficient power. Checked that plugs seated properly. Debugging showed half the top right power strip was getting no power (just top half of strip and not complete strip). Reset with white button, plugged nodes back in, they work.

UMD HEP T3 Computing Cluster