September 2009 Log
Kernel updates, debugged PhEDEx hanging, installed new CRAB releases.
September 30, 2009
MK -- Installed CRAB_2_6_3 and linked as current
- Haven't established if it works at UMD. Also did not modify interactive.xml to install CRAB_2_6_3.
September 24, 2009
MK -- Kernel updates
- Reboot scheduled.
- compute-0-5 had the correct modifications for the new Kernel inside /boot/grub/grub-orig.conf, none of the others did. compute-0-5's /boot/grub/grub.conf links to /boot/grub/grub-orig.conf, not grub-orig.conf. So perhaps the utility which modifies the file isn't properly using the relative path on the symlink. Modified all /boot/grub/grub.conf files to be a symlink to the fully pathed /boot/grub/grub-orig.conf. Tested on compute-0-3 (removed rpm, called yum update again) and it worked! Now /boot/grub/grub-orig.conf is being modified to boot to the *new* Kernel. Also modified the Kickstart file accordingly.
September 19, 2009
MK -- Rebooted GN, INs, HN to new Kernel
- The first time I rebooted all the nodes (including the compute nodes previously), I had them reboot to the non-SMP kernel files. SMP is needed for multiple CPU's (symmetric memory processors), so edited grub-orig.conf to load up the smp files, not the default files. Detected this problem because Ganglia showed not enough available CPU's and the incorrect number of CPU's on all nodes. Also found it via condor_status, which showed only 8 available slots in total (1/worker node, instead of 8/worker node).
- Had a Kernel panic on the HN because I told it to boot from partition / instead of partition /1 (silly Rocks). For future reference, to recover from Kernel panics without reinstall:
- Insert SL4.5 install disk 1
- Reboot machine
- At SL4.5 boot menu, type:
linux rescue - Follow the prompts. Generally you don't need to start the network interfaces. It's best to mount the linux system using "Continue" not "Read-Only", since you'll probably want to make some changes.
- At the linux prompt, type:
chroot /mnt/sysimage - Make whatever changes are needed. Most Kernel panics are caused by a problem in /boot/grub/grub.conf, which in Rocks is a symlink to /boot/grub/grub-orig.conf.
- Type exit twice and remove the CD from the drive so the OS can boot normally.
- XFS kernel module on grid node was for the wrong kernel. Needed to update that as well:
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/kernel-module-xfs-2.6.9-89.0.9.ELsmp-0.4-1.x86_64.rpm"
rpm -ev xfsprogs-2.6.13-3.2.el4.rf
rpm -ivh "http://ftp.scientificlinux.org/linux/scientific/45/x86_64/contrib/RPMS/xfs/xfsprogs-2.9.4-1.x86_64.rpm"
mount /data
All nodes need to remount /data. As root on the HN:
ssh-agent $SHELL
ssh-add
cluster-fork "mount /data" - After reboot, all nodes must start OMSA (/home/install/sbin/OMSAinstall.sh). It looks like configure doesn't have to be called, but I did just in case. Reminder: OMSA storage monitoring on the grid node is not configured!
- ipmi service is failing to start on all nodes, preventing the use of ipmish to query status. Will have to use Dell OMSA commands instead.
September 18, 2009
MK -- Updated Squid
- Now running Squid for Frontier version 4.0rc9.
September 17, 2009
MK -- PhEDEx hung, Kernel updates
- After waiting quite some time and after some machine reboots, PhEDEx did hang again, reproducing the same problems encountered earlier. Ricky suggested it might be a PhEDEx bug: the timeout mechanism for utility jobs fails to kill stuck processes - the result for FileDownload, after 10 completely stuck utility jobs, is no agent activity besides expirations, because no post validation can occur to finish transfers. Will be fixed in next PhEDEx release. Not clear what 'utility' jobs are exactly.
- In the meantime, I've suspended LoadTest subscriptions and restarted agents.
- Discovered that system was rebooting to old Kernel, had to modify grub-orig.conf to get it to boot to the new Kernel. interactive, grid, and headnode await reboot - all ready with new grub-orig.conf file.
September 13, 2009
MK -- Kernel updates
- Used yum update (had to remove tomcat-connectors, then install afterwards) to update kernel on all nodes. Rebooted all nodes to get to the new Kernel.
September 11, 2009
MK -- Testing PhEDEx LoadTests
- Increased LoadTest rate from FNAL & UNL to 5 MB/sec, unsuspended subscriptions, dropped -batch-files to 1 and -jobs to 1. Goal is to reproduce the PhEDEx hang which occurred with these setting previously. Under these conditions, in combination with a new PhEDEx release and the use of srm-copy instead of srmp, PhEDEx hung completely in both the Debug and Prod agents. Not sure what the cause of the problem was, but had to suspend subscriptions to get Prod downloads to continue. Now attempting to duplicate the problem.
September 10, 2009
MK -- Installed CRAB_2_6_1 & 2_6_2
- CRAB_2_6_2 gave a python error, makes me suspect that it's calling some new gLite-UI utilities. For now, /scratch/crab/current is a link to 2_6_1, which appears to be what's needed for submitting to CrabServer.