Error log

Errors are organized by the program which caused them:

Guide is 99% out of date as of 2015 and will be removed and replaced shortly. ADMINS of hepcms: please consult our private Google pages for documentation.

Physical Node
RAID
Rocks
Condor
Logical volume
CMSSW
- CVMFS
srm
OSG/RSV
SiteDB/PhEDEx
Hadoop
/data or /sharesoft

Physical Node

Node shows with omreport chassis:
R510-0-6: Critical : Voltages
In the omreport system alertlog, this shows:
R510-0-6: Severity : Critical
R510-0-6: ID : 1154
R510-0-6: Date and Time : Sun Mar 4 15:38:50 2012
R510-0-6: Category : Instrumentation Service
R510-0-6: Description : Voltage sensor detected a failure value
R510-0-6: Sensor location: System Board 5V PG
R510-0-6: Chassis location: Main System Chassis
R510-0-6: Previous state was: Unknown
R510-0-6: Discrete voltage state: Bad
This actually means you have to update the firmware. Only after doing that, consult Dell for support.
Node has two green lights in the back for the PDU power supply, but pressing the power button doesn't power the node, even if you hold it down for awhile (tested on R510s)
- From memory in Dell support, remove all power to the node (remove both power plugs), then press and hold the power button for at least a count of 20 seconds (maybe 30). Then plug node back in and press power button. This helped one node after a thunderstorm shutdown.

RAID

During HN boot:
Foreign configuration(s) found on adapter.
Followed by:
1 Virtual Drive(s) found
1 Virtual Drive(s) offline
3 Virtual Drive(s) handled by BIOS
This Dell troubleshooting guide is a useful resource. In our case, this occurred because we booted the HN before the disk array had fully powered up. We believe this also corrupted the PERC-6/E RAID controller configuration. Upon subsequent shut down of the HN, full disk array power-up, followed by powering the HN again, we loaded the foreign configuration (pressed the key f). The RAID controller can also be configured again using the configuration utility (c or Ctrl+r).

Rocks

General guide for Rocks commands: On the HN as root (su-), do the following (example for all nodes except frontend):
ssh-agent $SHELL
ssh-add
rocks run host compute R510 interactive SE grid "ls /data" collate=yes
- You can also parse commands as in this particular example (sends to one node, uses sed):
  rocks run host interactive-0-3 "sed 's/LOCAL_CONFIG_FILE/#LOCAL_CONFIG_FILE/' /etc/condor/condor_config > /etc/condor/condor_config_fix"
If while installing Rocks on the frontend, you get an error of an "unhandled exception" with a screen full of debugging and kickstarting does not finish, check the following:
- See if the version of SL and Rocks you are using are compatible with each other
- Make sure there are no previous versions of /export/rocks/install which may be conflicting with the new installation. Note that when it comes time to kickstart other nodes, assuming the frontend (head node) kickstarted correctly, you will want xml files in their appropriate place in /export/rocks/install
An error occurred when attempting to load an installer interface component className=FDiskWindow"
Rocks is complaining that the partition table in the kickstart file is incorrect. Depending on your situation, you may need to force the default partitioning scheme to recover, probably losing all data on your disks.
shoot-node gives errors:
Waiting for ssh server on [compute-0-1] to start
ssh: connect to host compute-0-1 port 2200: Connection refused
...
Waiting for VNC server on [compute-0-1] to start
Can't connect to VNC server after 2 minutes
ssh: connect to host compute-0-1 port 2200: Connection refused
...
main: unable to connect to host: Connection refused (111)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/scratch/home/build/rocks-release/rocks/src/roll/base/src/foundation-python/foundation-python.buildroot//opt/rocks/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/opt/rocks/sbin/shoot-node", line 313, in run
os.unlink(self.known_hosts)
OSError: [Errno 2] No such file or directory: '/tmp/.known_hosts_compute-0-1'
and examination of WNs reveals they are trying to install interactively (i.e., requesting language for the install, etc.):
This seems to occur most commonly when there is a problem with your Kickstart files used for the Rocks distribution. The solution which works most consistently is to remove all your modified Kickstart files (leave skeleton.xml). Depending on the error, you may also have to force default partitioning, which will cause you to lose existing data on the nodes.
shoot-node & cluster-kickstart give the error:
error reading information on service rocks-grub: No such file or directory
cannot reboot: /sbin/chkconfig failed: Illegal seek
This occurs when the rocks-boot-auto package is removed, which prevents WNs from automatically reinstalling every time they experience a hard boot (such as power failure). This error can be safely ignored.
Kickstart of a node shows on the screen:
Could not allocate requested partitions:

Partitioning failed: Could not allocate partitions as primary partitions.
Not enough space left to create partition for /.

Press 'OK' to reboot your system.
Although the Rocks partition guide suggests you can use the syntax "--ondisk sda", the RHEL "Kickstart Configurator" GUI uses the syntax "--ondisk=sda". We also added the lines:
--bytes-per-inode=4096 --fstype="ext3"
But this probably did not fix the problem. It is likely it was the lack of the "=" that caused the issue.
Please also note that due to Hadoop being installed, the standard kickstart files do not format disks (to maintain existing files, you usually do NOT want to re-format). Some working xml files for formatting can be found on the HN as root.

Condor

Condor job submission works from the HN, but none of the WNs. Usually this is a permissions problem involving incorrect network and/or UID settings. Configure condor to talk on the internal network only and be sure to set UID_DOMAIN to local on all nodes.
Nothing comes up after condor_q. Check that the HN disk isn't full from large logs (df -h).
Condor commands fail to work after a software update. Check that the configuration file(s) were not over-written in /etc/condor/config.d/. Consult the General Condor configuration guide for proper settings and replace this file.

Logical Volume

Insufficient free extents (2323359) in volume group data: 2323360 required (error is received on command lvcreate -L 9293440MB data).
Sometimes it is simpler to enter the value in extents (the smallest logical units LVM uses to manage volume space). Use a '-l' instead of '-L' and specify the maximum number of free extents (provided by the error):
lvcreate -l 2323359 data

CMSSW

cmsRun works on the HN, but none of the WNs.
Note this error could be caused by any number of issues; when we encountered this error, it was because we set VO_CMS_SW_DIR to the local node's directory, rather than the network mounted directory. We had to completely reinstall CMSSW.
E: Sub-process /sharesoft/cmssw/slc4_ia32_gcc345/external/apt/0.5.15lorg3.2-CMS19c/bin/rpm-wrapper returned an error code (100)
This link suggests that it is due to a lack of disk space in the area where you are installing CMSSW. However, because we install in /sharesoft and /sharesoft is auto-network mounted, the size of /sharesoft doesn't print until it's been explicitly ls'ed or cd'ed. When RPM checks that there is enough space in /sharesoft to install, it fails. When executing apt-get, add the option:
apt-get -o RPM::Install-Options::="--ignoresize" ...
error: unpacking of archive failed on file /share/apps/cmssw/share/scramdbv0: cpio: mkdir failed - Permission denied
This error occurs because both bootstrap.sh and the CMSSW apt-get install create a soft-link to the 'root' directory where CMSSW is being installed. In our case, since we first tried to install CMSSW to /share/apps (automatically network mounted by Rocks), the soft link is named share. However, CMSSW also has a true subdirectory named share and does write files to this directory. The soft link overrides the true directory and resultantly, CMSSW tries to install to /share, where it does not have permission. In short, CMSSW cannot be installed to any root directory named /share, /common, /bin, /tmp, /var, or /slcX_XXX.
apt-get update issues the error:
E: Could not open lock file /var/state/apt/lists/lock - open (13 Permission denied)
E: Unable to lock the list directory
Be sure to first source the scram apt info:
source $VO_CMS_SW_DIR/$SCRAM_ARCH/external/apt/<apt-version>/etc/profile.d/init.csh
Version of CMSSW the user wants is not found
- Be sure to check different SCRAM_ARCH following the User Guide
- CMSSW versions are automatically installed and removed as depreciated. If a version vanishes and a user requires it, you can install it by hand following Admin Guide CMSSW install instructions.
  
  CVMFS
  CVMFS failures are evident when CMSSW, CRAB, and similar scripts do not work, as well as the soft link /sharesoft/cmssw/cmsset_default.csh (or .sh) - note the soft link could be affected by problems with /data. We have also seen that users get hangs when logging in (sourcing cvmfs in the shell script), or doing ls or any CMSSW or root commands. There are some debugging guides which you may find useful at OSG, CERN, and UFL. These are commands we've found that have worked from the CERN guide.
  - First check to see if the directory which mounts CVMFS is full: R510, compute: /tmp; grid & interactive: /scratch; HN & SE: no CVMFS
  - If you need to unmount cvmfs, the proper way to unmount it is (as root su - on that node):
    cvmfs_config umount
    - I have also seen umount -l /cvmfs/cms.cern.ch
  - Remount as a user with ls /cvmfs/cms.cern.ch/cmssw_default.csh
  - Check that no instance of CVMFS is running (this should give back only your grep command, if the mount is hanging you may see "mount" instances from this):
    ps -aux | grep cvmfs2
  - Wipe the cache:
    cvmfs_config wipecache
  - Run a filesystem check (there may be errors, but I have not found them to be illustrative, note that you need to have /scratch instead of /tmp below for some nodes as indicated above):
    cvmfs_fsck -j 4 /tmp/cvmfs/shared
  - Check the configuration is valid (it should return OK):
    cvmfs_config chksetup
  - Test by listing files mounted on CVMFS (it's an auto-mount system and will mount when you ask for it):
    ls -alh /cvmfs/cms.cern.ch/cmsset_default.sh
  - Pro tip: The above set of commands are saved in a script (run it 3-4 times before giving up) /sharesoft/osg/scripts/fixCVMFS.sh
  - The cvmfs system uses automount, we have also been advised to (use this if above doesn't work): cvmfs_config umount; service autofs reload; cvmfs_config wipecache; mount as user (see above)
  - Please note that if cvmfs appears to work but users are getting weird errors about missing parts of CMSSW and root not working, or long hangs, you should check that SQUID is working (there is a link on the main page to see squid results). One key to this is that when you issue the command: cvmfs_config chksetup that it reports errors connecting to the squid proxy, i.e.:
    - Warning: failed to access http://cvmfs.fnal.gov:8000/opt/cms/.cvmfspublished through proxy http://hepcms-hn.umd.edu:3128
    - The solution is to check and then start SQUID on the head node, check the OSG guide for the latest commands, but it is most likely: service frontier-squid status; service frontier-squid start (if it is not running, if it is, do a restart)
    - Also check that squid has sufficient running space on the head node (df -h /scratch), and check for excessive network activity on the HN and the node with cvmfs troubles

SRM

srmcp issues the error:
GridftpClient: Was not able to send checksum
value:org.globus.ftp.exception.ServerException: Server refused
performing the request. Custom message: (error code 1) [Nested
exception message: Custom message: Unexpected reply: 500 Invalid
command.] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
message: Unexpected reply: 500 Invalid command.]
but the file transfer is successful.
This error occurs because srmcp is an srm client developed by dCache with the special added functionality of a checksum. BeStMan uses the LBNL srm client and does not support srmcp checksum functionality, nor does Globus gridftp. This error can be safely ignored.

OSG/RSV

OSG1.2 erorr: not necessarily OSG3: The RSV cacert-crl-expiry-probe fails with an error to the effect of:
/sharesoft/osg/ce/globus/TRUSTED_CA/1d879c6c.r0 has expired! (nextUpdate=Aug 15 14:28:32 2008 GMT)
This can occur because, for one reason or another, the last cron jobs which should have renewed the certificates did not execute or complete for that particular CA in time. You can manually run the cron jobs by first searching for them in cron:
crontab -l | grep cert
crontab -l | grep crl
then execute them:
/sharesoft/osg/ce/vdt/sbin/vdt-update-certs-wrapper --vdt-install /sharesoft/osg/ce
/sharesoft/osg/ce/fetch-crl/share/doc/fetch-crl-2.6.2/fetch-crl.cron
If fetch-crl.cron prints errors about "download no data from... persistent errors.... could not download any CRL from...", ignore them as long as voms-proxy-init works when fetch-crl.cron completes.
OSG1.2 error: not necessarily OSG3: MyOSG GIP tests give the error when using gridmapfile in the OSG CE:
GLUE Entity GlueSEAccessProtocolLocalID does not exist
CEMon gets information for BDII by issuing various srm commands using your http host cert. The distinguished name (DN) of your http host cert needs to be added to your grid-mapfile-local and mapped to a user account.
OSG services not working (some or all) after a software update: Check the screenlog where you did the update and look for a warning like the following example:
warning: /etc/globus/globus-condor.conf created as /etc/globus/globus-condor.conf.rpmnew
This means that your configuration file was changed. Be sure that you did not lose your original settings by comparing the two files, and where needed, consulting the OSG configuration page. Be sure to re-run osg-configure commands (OSG Release3 twiki for those commands), and restart affected OSG services.
If all appears to be well (OSG services running), and all RSV tests fail (GRAM Authentication test failure: authentication with the remote server failed, etc.), check to make sure GUMS is running. If necessary, use the instructions below to Stop OSG, Restart GUMS, and Start OSG.
Check OSG services status with the following commands:
/sbin/service rsv status
/sbin/service condor-cron status
/sbin/service httpd status
/sbin/service bestman2 status
/sbin/service globus-gridftp-server status
/sbin/service gratia-probes-cron status
/sbin/service tomcat5 status
/sbin/service globus-gatekeeper status
/sbin/service gums-client-cron status
/sbin/service condor status
/sbin/service fetch-crl3-cron status
/sbin/service fetch-crl3-boot status
Stop OSG

OSG services are supposed to start during boot time, but because they are dependent on so many other services, they sometimes come up in an unstable state (This and GUMS restart did not seem needed in OSG3 in our last reboot). They also need to be started after GUMS comes up on the HN. Best practice is to manually restart OSG3 services after GN reboot. As root (su -) on the GN:

/sbin/service rsv stop
/sbin/service condor-cron stop
/sbin/service httpd stop
/sbin/service bestman2 stop
/sbin/service globus-gridftp-server stop
/sbin/service gratia-probes-cron stop
/sbin/service tomcat5 stop
/sbin/service globus-gatekeeper stop
/sbin/service gums-client-cron stop
/sbin/service condor stop
/sbin/service fetch-crl3-cron stop
/sbin/service fetch-crl3-boot stop

Restart GUMS

GUMS is running on SE-0-2 and handles authentication for OSG services. Here is a link to the OSG GUMS troubleshooting guide. If needed, as root on SE-0-2 (su -), stop and then start services:

/sbin/service tomcat5 stop
/sbin/service mysqld stop
/sbin/service fetch-crl3-boot stop
/sbin/service fetch-crl3-cron stop

/usr/sbin/fetch-crl3
/sbin/service fetch-crl3-boot start
/sbin/service fetch-crl3-cron start
/sbin/service mysqld start
/sbin/service tomcat5 start

You may check that GUMS is running by authenticating with your GUMS administrator GRID cert at https://hepcms-1.umd.edu:8443/gums/.

Start OSG

This is assuming you have stopped OSG3 above, and then restarted GUMS. As root (su -) on the GN:

/sbin/service fetch-crl-boot start
/sbin/service fetch-crl-cron start
/sbin/service condor start
/sbin/service gums-client-cron start
/sbin/service globus-gatekeeper start
/sbin/service tomcat5 start
/sbin/service gratia-probes-cron start
/sbin/service globus-gridftp-server start
/sbin/service bestman2 start
/sbin/service httpd start
/sbin/service condor-cron start
/sbin/service rsv start

SiteDB/PhEDEx

After attempting to log in to PhEDEx via certificate, a window pops up several times requesting your grid cert (already imported into your browser) and after multiple OK's, eventually goes to a page with the message:
Have You Signed Up?

You need to sign up with CMS Web Services in order to log in and use privileged features. Signing up can be done via SiteDB.

If you have already signed up with SiteDB, it is possible that your certificate or password information is out of date there. In that case go back to SiteDB and update your information.

For your information, the DN your browser presents is:

/DC=something/DC=something/OU=something/CN=Your Name ID#
This problem occurs when your SiteDB/hypernews account is not linked with your grid certificate. Go to the SiteDB::Person Directory (SiteDB only works in the Firefox browser), login with your hypernews account and follow the link under the title labeled "Edit your own details here". In the form entry box titled "Distinguished Name", enter the DN info displayed earlier and click on the "Edit these details" button. You should then be able to login to PhEDEx with your grid certificate in 30-60 minutes.

Hadoop

On the compute and R510 nodes, the Fuse mount of /hadoop is lost, and it returns:
ls: /hadoop: No such file or directory
Unmount and remount /hadoop using (can do this with rocks):
umount -nf /hadoop
mount /hadoop
If that fails, wait 5-10 minutes, and try those commands again.
Hadoop health can be checked by using firefox on a node inside the cluster (interactive) and browsing to se-0-2:50070/dfshealth.jsp
- If a node is reporting as dead, first make sure hadoop is running:
  service hadoop status
- Restart if needed and check status:
  service hadoop restart; service hadoop status
- Check to be sure the disk on that particular node is not full (df -h)
- If this still does not work, check /scratch/hadoop/log/hadoop-hdfs-datanode-nodename.local.log for what error is causing hadoop to fail
More helpful commands (to execute on the SE) can be found at the HadoopOperations twiki
Occasionally (was seen after a long shutdown), some of the individual hadoop data disks come back online with the wrong ownership and thus hadoop does not run on that particular datanode. If restarting hadoop services does not fix it automatically, you will have to change ownership to hdfs:users by hand (i.e., chown -R hdfs:users /hadoop9). This is NOT true of the FUSE /hadoop mount.
If for some reason you need to disable a disk from a hadoop datanode, remove that disk from /etc/sysconfig/hadoop (the line that has all the /hadoopX/data disks listed). Then make that take effect with service hadoop-firstboot start. Then restart hadoop with service hadoop restart. Make sure to check the log in /scratch/hadoop/log to be sure the changes took effect.
- Balancing Hadoop: Useful in the case where single datanode disks reach close to or at 100%. Note that hadoop has to be running on all datanodes for them to balance. Issue this command from the SE (su -), it may take some time to run and stress the network - use only one of the two following commands:
  /usr/lib/hadoop/bin/start-balancer.sh -threshold 5
  hadoop balancer -threshold 5
  - In the case where a single datanode disk of a particular node is full, hadoop will not start at all on the disk. If you want to re-start that node, you will have to delete by hand some blocks and then run Hadoop balancing afterwards. This is a very dangerous operation and can result in data loss if not done correctly, do not proceed with these instructions lightly. I have only attempted them sucessfully once, they were learned from the osg-hadoop@opensciencegrid.org mailing list.

Scratch cleaning: on R510s and compute nodes, the /scratch directory can become full with logs. To clean them, log into the node and become superuser (su -), then run the script: python /sharesoft/osg/scripts/pyCleanupHadoopLogs.py
- If you want to run the script in a rocks run host command, it has to be given the hostname as it doesn't read it properly. Instead run the following (set to remove all but the last 15 days of files, and any recent ones above 2GB):
  rocks run host compute R510 command="/sharesoft/osg/scripts/cleanHadoop.sh" collate=yes x11=no
Datanode disk repair: Sometimes individual disks have problems, which can be found with "smartd" errors in /var/log/messages. Upon reboot (especially in the R510s), fsck will be run and can fail, which means you have to work with that machine directly from the console in the cluster room. Things to try (look up on another R510 which /dev/sd* corresponds to your particular /hadoopX disk):
- After entering the root password on the broken machine, go into filesystem recovery mode, and fsck -y /dev/sde (example for /hadoop5)
- Failing that, you may need to reformat an individual disk, this is easier with the node accessible remotely. To do so, you need to edit /etc/fstab and remove the line for that disk (for instance, remove the line with /hadoop5). First, mount the filesystem as read-write (in filesystem recovery mode):
  mount -o remount,rw /
  nano /etc/fstab
- To reformat a disk, first, make sure that hadoop is not missing too many blocks before proceeding (check with firefox on a node inside the cluster (interactive) and browsing to se-0-2:50070/dfshealth.jsp). Proceed with caution as you can destroy data, this only works on a /hadoopX directory with the asummption that the blocks you are deleting have been healthily replicated elsewhere. As root (su -) on the machine that has the failed disk:
  - First, delete the partition and make a new one with fdisk (example for /hadoop5 in /dev/sde):
    fdisk /dev/sde
    - Type p to list partitions
    - Type d to delete partition, select partition 1
    - Type n to make a new parititon, make partition 1, then select defaults for the rest of the options
    - Type w to write the partition table
  - Then, make the filesystem (this command will take some time)
    mkfs.ext3 /dev/sde1
  - Label the filesystem
    e2label /dev/sde1 /hadoop5
  - Add a line to /etc/fstab if needed
    LABEL=/hadoop5 /hadoop5 ext3 defaults 1 2
  - Mount the filesystem
    mount /hadoop5
  - Restart hadoop
    /sbin/service hadoop restart

/data

ls /data shows nothing (expected result is groups se users), then unmount and remount as root (su-) on the node where it failed. Also check to be sure other nodes have mounted /data (if the disk is not mounted properly on GN, this will fail, see below)
umount -nf /data; mount /data; ls /data
Input/Output error: /data/users: Input/output error, or umount -nf /data; mount /data gives: mount: /dev/mapper/data-lvol0: can't read superblock
- This should be a rare occurence, and one should first check the health of the GN (df -h), and then health of the big disk RAID:
  omreport chassis
  omreport storage pdisk controller=1
- You will also see the following type of errors in /var/log/kern for the last ~day:
  hepcms-0 kernel: I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0x9650 ("xfs_trans_read_buf") error 5 buf count 8192
  hepcms-0 kernel: scsi 1:2:0:0: rejecting I/O to dead device
  - Note that the disk won't go down until it gives this error (sometimes half to full day after the first occurence of error):
    hepcms-0 kernel: Filesystem dm-0: I/O Error Detected. Shutting down filesystem: dm-0

Then reboot the GN, following Recovery procedure. Be sure to unmount and remount /data on all other nodes, it is not optional in this case.

Red light on the front of one of the disks in the physical /data cabinet: usually means the individual disk in the RAID array has failed. If reboot does not repair it, obtain a replacement. Commands useful to check which disk is unhealthy and how (on GN as su - ):
omreport storage pdisk controller=1
- Make a specific disk flash:
  omconfig storage pdisk action=remove controller=1 pdisk=0:0:5
- Consult the appropriate Dell manual for this particular RAID system for disk replacement, Dell PowerVault MD1000. Current web search link for this manual.
Problem: After an unplanned storm-induced power outage, the GN and other nodes were not able to mount /data. /data on the GN was mounted but very small (backup-script.sh produced /data/users/root/ce.tar but without the RAID array being visible, it produced a new filesystem that is NOT /data). Note that this is after all other fixes above failed, and was following explicit instructions from Dell Support. Following them without care can DAMAGE /data.
- Check to see if RAID array is indeed not visible by browsing: https://hepcms-0.umd.edu:1311/servlet/OMSAStart?mode=omsa&vid=106240111612353
  Note that the "Storage" component is in yellow warning status. Looking at details, all disks in "Foreign" state.
- Might also be able to verify above with this command on GN:
  omreport storage pdisk controller=1
- To fix (this was instructed by Dell support), on the web browser, go back up to PERC 6/E Adapter (PCI Slot 2) tab, and choose "Information/Configuration" near the top of the page. Of "Available Tasks" choose the option that deals with "Foreign Configuration. DO NOT choose "Reset Configuration." For the "Foreign Configuration task, press "Execute". This brings up a new page, then perform the import of the foreign configuration.
- Should see that all disks chaing to Online, and the state is now "Ready"
- May require a reboot of GN and remount of /data on nodes after this is fixed.

UMD HEP T3 Computing Cluster