May 2008 log

May was devoted to getting familiar with Rocks basics and installing non-grid software.

To Do & Sandbox

May 30, 2008

MK--Reinstalled the cluster.

No major hitches. Network mounted /software. /data waiting on XFS installation (need info from Mark, TODO). Restored security, except for one issue (TODO). Installed Pacman, apt-get, and CMSSW. cmssoft .cshrc file is generating an error:
E: Could not open lock file /var/state/apt/lists/lock - open (13 Permission denied)
E: Unable to lock the list directory
Prevents apt-get update from running at login time, but can still run manually from command line. (TODO)

May 29, 2008

MK--Attempted to network mount /data, discovered HN services destroyed by gLite UI.

gLite removed existing Rocks python, needed for nearly all Rocks services. 411 service was also somehow uninstalled. Attempted reinstall of python, but unsuccessful (possibly gLite is forcing the python install to go elsewhere). 411 cannot be reinstalled without Rocks python. HN services destroyed, will reinstall HN and entire cluster (TODO). Will attempt gLite UI install on a WN (TODO).

May 28, 2008

MK & JT--Completed gLite UI install.

gLite UI installation complete. Had to install JPackage and SOAP. gLite either recognized the Log4J install we did yesterday or JPackage provided it (suspect second, which is why instructions for explicit Log4J install on the Admin How-To guide have been removed). gLite UI not configured, so unable to test functionality. (TODO)

May 27, 2008

MK & JT--Completed network mount of /software and installed CMSSW. Checked external network. Started CRAB w/gLite install.

/software had not been properly mounted on WNs because the auto.software file had not propagated to the WNs. The 411 service had to be fully cleaned and made (instructions from the Rocks-Discuss listserv).
OIT folks fixed problem with connection to outside world on some of the patch panel ports. All WNs now have externally addressable IPs and respond to pings.
gLite UI appears to want JPackage in addition to a number of other pieces of software. Installed SOAP::Lite and LOG4J manually (perhaps unnecessarily, JPackage may provide these). Begun JPackage install.

May 23, 2008

MK--Attempted to resolve WN network problems.

All nodes connected to the second patch panel are not getting through to the outside world.
Two nodes do not have any connection whatsoever (no lights on interface).
Verified with multiple cables and multiple machines. Emailed OIT.

May 22, 2008

MB & MK--Reformatted /data as LVM2. Configured WNs for external network, some failed.

/data uses a GUID partition table. XFS does not normally set up the partition table correctly, so programs which query for the space available on /data may not have worked. Using this new style should prevent that problem.
compute-0-0, 0-1, 0-2 connected externally, all others appeared successful but cannot be pinged.

May 21, 2008

MK--Networked eth1 on all WNs to connect to patch panel, attempted NFS mount for CMSSW.

Purchased cables. Blue=eth0=switch, gray=eth1= patch panel. eth1 not yet enables on any WNs (TODO).
Attempted NFS mount of /scratch/cmssw and /scratch/other on HN as /software/cmssw and /software/other on all nodes. CMSSW doesn't like /share, so want to install CMSSW in /software/cmssw. No error messages, but also no network mount (TOFIX).

May 20, 2008

MB, MK & MT--Added security, continued CMSSW installation attempts

Security info detailed in ~root/security.txt.
Banged head uselessly on CMSSW. Problem isolated to soft link created in /share/apps/cmssw named share, pointing to /share. CMSSW also needs a true directory named /share/apps/cmssw/share, in which we believe scram installs. The soft link over-writes the needed true directory and scram attempts to install to a directory it doesn't have permission to access. Problem forwarded to framework folks. Solution from framework unlikely. Will move network mounts or create a new one especially for CMSSW at a later time. Abandoning for now, in favor of prep work for Paul & Tony's arrival.

May 16, 2008

MB & MK--Added security, user modification

Continued attacks indicated we needed to take further proactive steps to protect the cluster. Details of configuration not elaborated here for security purposes, they can be read by root in ~root/security.txt.
Removed ssh access by root, removed some temporary users.

May 15, 2008

MK & MT--Configured Kerberos, tried to reinstall CMSSW, added area51 roll

Examining /var/log/auth indicated we have been receiving concentrated ssh attacks from a small set of people/bots. Our latest queries with OIT indicated that we would not be able to put our cluster behind the UMD firewall, so we installed the area51 (security) roll, which may help with the problem.
Kerberos has been configured for both CERN & FNAL and it is possible to get tickets. However, the tickets don't appear to be used on ssh or scp attempts. This is to be expected for CERN, but not FNAL (TODO). Also need to test CERN tickets once CVS is configured (TODO).
CMSSW install at /share/apps/cmssw instead of /export/apps/cmssw ran into some hiccups:
- RPM checks for disk space on /share, but network mounted drive doesn't list actual disk space available (solved).
- CMSSW installation creates a soft link in /share/apps/cmssw named share, which links to /share. CMSSW then tries to mkdir /share/apps/cmssw/share/scramdb, and naturally fails (logged in as cmssoft user). What is confusing is why it worked when we installed in /export/apps/cmssw the first time around (TODO).

May 14, 2008

MK & MT--Installed CMSSW

Tests on HN were largely successful.
Outstanding issues: CMSSW unable to contact Frontier, WNs don't have scramv1 (possibly others).
Also need to set up for CVS.

May 13, 2008

MK--Installed Pacman & requested our two certificates (host & http).

Performed temporary install of VDT PPDG-Cert-Scripts package for the purpose of requesting certificates. Once requests complete, I will remove the temporary VDT install directory in favor of the coming complete OSG installation.

May 12, 2008

MB, MK, & MT--Formatted bigdisk using XFS, began CMSSW installation.

Downloaded and compiled XFS, integrated with Kernel.
Formatted big disk using XFS, mounted as /data on HN. Needs to be mounted for the WNs.
Began first half of CMSSW installation, steps necessary for all CMSSW installations. Need to install a particular CMSSW release.

May 7, 2008

MB & MK--Finished WN Rocks installation, successfully tested user creation, and Condor job submission.

Forced the default partitioning scheme on the WNs to eliminate the Kernel panics and boot issues. Then went back and changed the WN partitions post-install. Setting the WN partitions pre-Rocks-WN-install seems to be largely unsuccessful in nearly all use-cases.
Condor jobs submit with no issues whatsover. Unknown what was causing the problem before.
Added users, including cmssoft.

May 6, 2008

MB & MK--Created users, tested Condor job submission, ganglia monitoring, and re-installed Rocks

Created users by creating their home area in /home/export, then issuing the rocks-sync-config and rocks-sync-users command (See Admin How-to: Install Rocks).
Condor jobs are submitting correctly and are able to access /home/username area, but claim they cannot access the /sbin/sleep executable. This executable is listed and can be interactively executed by manually logging into the WN. Issue unresolved.
Opened https and www ports to allow remote access to ganglia monitoring (see Admin How-to: Install Rocks).
Re-installed Rocks because of typo in ClusterName and because we came to the conclusion that the grid roll would cause later problems. WNs appear to be having issues. compute-0-0 is the only one to install successfully. Will investigate tomorrow.

May 2, 2008

MB & MK--Completed Rocks install on WNs.

Set the switch to get its IP from the HN using DHCP. Disabled the spanning tree, which prevented Rocks WNs from contacting the HN correctly.
We received the error "className=FDiskWindow" during WN install. This was due to an incorrect replace-auto-partition.xml file (we used the device names hda and hdb instead of sda and sdb). All WNs now have the correct partitions:

/dev/sda   76293 :
root/      8192 /sda1 ext3
swap      8192 /sda2 ext3
/var       4096 /sda3 ext3
/state/partition1 55813 /sda4 ext3

/dev/sdb    238418 :
/tmp   238418 /sdb1 ext3
It is undecided at this time what the role of /state/partition1 will be. /tmp is meant as CMSSW temporary job output prior to transfer to the big disk, if users choose to not use their /home area for job output. We believe both /state/partition1 and /tmp will be preserved over Rocks/OS upgrades.
Edited the node LEDs to indicate the names of each node as given by Rocks.
To do: Rename cluster in Rocks, if possible (typo in current name). Create the HN /sdc logical volume. Make sure WNs can send output from /tmp to the HN /sdc volume without having to employ srmcp. Test condor job submission from HN to WNs. Test network to outside world and ssh from outside world into HN. Configure Ganglia. Get site grid certificate and configure Globus Toolkit. Install Pacman. Install remaining software.

May 1, 2008

MB, MK, & MT--Connected to the switch via the serial connection and installed Rocks on the HN.

Established serial connection using VT100 emulator to our switch using XXXXX (See Admin How-To: connect to the switch).
Installed Rocks on the HN (See Admin How-To: install Rocks):

/dev/sda 69374, RAID-1 67.75 GB
root/      8189 /sda1 ext3
swap       8189 /sda2 swap
/var       4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch 48901 /sda5 ext3

/dev/sdb 418168, RAID-5 408.38 GB
/export  418168 /sdb1 ext3

/dev/sdc XXXXXX, RAID-6
Left alone at this time, Rocks cannot handle logical volumes at the time of install.

Attempted to install Rocks on WNs, but WNs requests for IP information are not making it to the HN (suspect switch issue).

UMD HEP T3 Computing Cluster