April 2008 log
April was devoted to getting the cluster operational. We performed basic tasks such as OS installation, partitioning, and networking.
April 30, 2008
MB, MK, & MT-- Reinstalled OS via Rocks. Rocks installed Ganglia, Condor, Globus toolkit, Scientific Linux 4.5, and other base utilities
-
Changed HN partitions to do Rocks installation:
/dev/sda 69374, RAID-1 67.75 GB physical disks 0:0:0, 0:0:1 :
root/ 8189 /sda1 ext3
swap 8189 /sda2 swap
/var 4095 /sda3 ext3
/sda4 is the extended partition which includes /sda5
/scratch 48901 /sda5 ext3
/scratch is meant as shared user storage area
/dev/sdb 418168, RAID-5 408.38 GB physical disks 0:0:2, 0:0:3, 1:0:4, 1:0:5 :
/export 418168 /sdb1 ext3
/export will contain the users' home area as well as most of the software. It will be automatically mounted for all the worker nodes to access via /share. Applications will be installed in /export/apps on the HN (accessible via /share/apps on the WNs)./dev/sdc
Left alone at this time, would like to do logical volume later if possible, though Rocks claims it doesn't support LVMs.
- Network configuration (HN is connected to switch and patch-panel, WN connected to switch):
HN external IP: 128.8.164.12 (eth1)
HN internal IP: 10.1.1.1 (eth0)
Gateway: 128.8.164.1
Netmask for external internet: 255.255.255.192
Netmask for internal network (on HN): 255.0.0.0
DNS: 128.8.74.2, 128.8.76.2
- Error encountered in install:
Traceback (most recent call last):
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/gui.py", line 1074, in handleRenderCallback
self.currentWindow.renderCallback()
File "/tmp/updates/usr/lib/anaconda/progress_gui.py", line 249, in renderCallback
self.intf.icw.nextClicked()
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/gui.py", line 789, in nextClicked
self.dispatch.gotoNext()
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/dispatch.py", line 171, in gotoNext
self.moveStep()
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/dispatch.py", line 239 in moveStep
rc=apply(func, self.bindArgs(args))
File "/tmp/ksclass.py", line 1414, in RocksPreInstall
File "/tmp/ksclass.py", line 1409, in RocksReadComps
File "/tmp/updates/usr/lib/anaconda/packages.py", line 172, in readPackages
grpset=method.readComps(hdrlist)
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/installmethod.py", line 65, in readComps
return self.readCompsViaMethod(hdlist)
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/urlinstall.py", line 96, in readCompsViaMethod
return groupSetFromCompsFile(fname, hdlist)
File "/var/tmp/anaconda-10.1.1.63//usr/lib/anaconda/hdrlist.py", line 937, in groupSetFromCompsFile
raise FileCopyException
NameError: global name 'FileCopyException' is not defined
This issue was resolved using a special SL4.5 'comps' roll for Rocks 4.3, as outlined here.
April 28, 2008
MB & MK -- Repartitioned head node, configured PERC6/E controller on head node to control big disk.
-
Changed HN /usr partition to take entirety of previous /usr and /scratch. HN will contain an NFS mount for CMSSW installations, so requires additional space for all the versions of CMSSW that we wish to install. HN partitioning was later changed (again), see later logs for new partitions.
-
Connected big disk to HN using HN PERC6/E controller. Used RAID-6 (loss of two disks, which results in minimal performance hit when we lose one disk and no loss of data until we lose 3 disks). Mark recommends that we keep one disk on hand that we can swap out once a disk fails. Formatted using a logical volume so that new arrays can be added and virtually extended without reallocations and formats. Used XFS formatting, which is best for large data arrays.
/dev/sdc 9744.8 GB, RAID-6, 15 physical disks, labeled bigdisk1, 256kB striping, read-ahead cache enabled (cache size TBD in filesystem), write-back cache enabled (as we have a battery backup on the controller itself)
/XXX XXXXXX /sdc1 LVM XFS
April 10, 2008
MK -- Installed Scientific Linux 4.5 on all nodes
- Partitions were later redone and SL4.5 was installed via Rocks