Notes, To Do, & Sandbox
If you arrived at this page by Google search, odds are this page won't help you, it's mostly 'notes to self'. Try the admin guide. Also, this page is depreciated, as the To Do list was moved offsite to be available for sysadmins only.
Notes
The next shoot-node/cluster-kickstart will:
(last updated November 4, 2011)
- For all nodes:
- Install hadoop-0.20-osg instead of hadoop (0.19). All WNs had been Kickstarted with the hadoop 0.19 Kickstart configuration, but none have been Kickstarted with the new hadoop 0.20 Kickstart changes (except R510-0-17). Tested install manually via command line and the correct rpms are in the contrib directory, new rocks distro is made and prepared to be served. However, the new syntax for hadoop-0.20-osg in the Kickstart hasn't been tested. The changes between 0.19 and 0.20 are in the log for Aug 26.
- Install Fuse kernel (fuse-kmdl-2.6.18-274.3.1.el5-2.7.4-8_12.el5.x86_64) - make sure it matches current kernel install
- Soft link /store to /hadoop/store (changed by hand)
- For INs, SE, and GN (not kickstarted since last change to WN xml, or changes to appliance specific XML):
- Call yum -y update
- Use new /etc/fstab mounting options for /data.
- For INs only (not kickstarted since last change to WN xml, or specific changes to IN xml only):
- Set PER_JOB_HISTORY_DIR in condor_config.local
- Install SL5 compatibility libs
- Install gLite-UI 3.2
- Install zlib-devel
- Install CRAB_2_7_4_patch1 (only)
- Install perl-libwww-perl.noarch
- Install tkinter
- Remove the lcg-CA.repo file from /etc/yum.repos.d after installation of lcg-CA so that calling yum-update won't grab all the CA crap again.
- Install and configure Hadoop & FUSE. Especially of note here is that I don't know if /bin/hostname works mid-Kickstart. The IN Kickstart file is configured to specifically have IN-0-0 turn the Hadoop services on after machine boot. I don't know if this particular code works! After Kickstarting IN-0-0, make sure the Hadoop service is running on the node. Easiest way is to turn it off - it will say there is nothing to turn off if it wasn't running. It's OK if the secondary namenode (IN-0-0) goes down for brief periods.
To Do:
(last updated September 30, 2013 )- Kickstart modification:
- Add rpms for gridftp-hdfs install to grid node RPMS area, change xml file
- Do a rocks create distro afterwards and verify health of xml
- Jeff's cron job commands into kickstart for HN
- TCP tuning into kickstart for compute and R510 nodes
- TCP tuning into kickstart for GN
- Put commands for srvadmin-storage in extend-compute.xml, update web page version of that
- Change kickstart instructions for IN nodes to install yum install libX11-devel; yum install libXt-devel
- Update website version of kickstarts
- Channel bonding for R510s to kickstart (may not be possible?)
- Resolve /hadoopX directory permissions issues that appear after kickstarting (see R510-0-17
- Add rpms for gridftp-hdfs install to grid node RPMS area, change xml file
- Other networking modifications to improve LAN-Hadoop network and FUSE mount uptime (compute; R510; grid; SE nodes)
- Update gLiteUI (to 3.2.10-0) and CRAB_2_7_9 to the latest once they work together
- Put latest tarballs of whatever CRAB and gLiteUI in proper area, and add to the interactive.xml, update interactive.xml on Admin Guide
- Update OSG to latest version (will require changes to config.ini and debugging with RSV probes), latest versions use a completely different RPM installation, may require 1-2 day downtime and debugging - check that gratia-gridftp-transfer enabled, maybe needs set in config.ini
- Update config.ini with hepspec
- Document method on website and archive old OSG pages, as I expect it will change
- Will require adding techniques to kickstart
- Website modification:
- Change web User guide to have latest dataset and CMSSW appropriate in crab (pick up dataset=/RelValProdTTbar/SAM-MC_42_V12_SAM-v1/GEN-SIM-RECO and use with CMSSW_4_2_x)
- Add brief section to admin guide about getting dataset using PhEDEx to pass SAM tests.Update xml and config.ini type files in Admin Guide pages (to go with 5.4 upgrade)
- Update krb5.conf posted on website to go with CERN kdc changes
- Consider updating GUMS guide for proper SE account setup
- Put no condor mailing in the default configuration on the web page (can fill up /var/log on HN)
- Document channel bonding for R510s on Guide, note Rocks bug
- Web log for gridftp-hdfs and other debugging things (see Oct 2011 notes, may be done)
- Hadoop settings debugging:
- Hadoop log4j.properties debug (somehow it's not working and /scratch can fill up on compute and R510 nodes
- If resolved, implement in kickstarts, and all datanodes (compute R510)
- Settings to not fill datanode /hadoopX disks
- Hadoop log4j.properties debug (somehow it's not working and /scratch can fill up on compute and R510 nodes
- Hadoop load balancing (many compute nodes have 100% /hadoopX disks, but is not currently causing troubles...
- New CMSSW install method with cron jobs
- Update firmware on R510 (except for 2 and 17) and SE (failed last downtime) during next downtime. See tricks on Admin Hardware page.
- See if OpenIPMI can be updated on HN. Reinstall. Or potentially just uninstall.
- Add logrotate scripts to the GN. Done already? If not, really should be done.
- Have PhEDEx perform auto-proxy renewal using ProxyRenew script inside PHEDEX/Custom/Template
- Tell BMCs to stop issuing DHCP requests that interfere with insert-ethers and shoot-node.
- Fix text of email alert to direct to hepcms-hn, not hepcms-0. Done already?
- Look into Condor configuration setting $_CONDOR_SCRATCH_DIR (or something like), which will start Condor jobs in /tmp. Right now this setting is /var, which could fill up our log directories on WNs very quickly. Condor is very good about cleaning up after itself when it is done. If this gets set in Condor, then glideIn jobs could go back to using this environment variable instead of the OSG WN TMP environment variable. We asked the glideIn factory folks (osg-gfactory-support@NOSPAM.physics.ucsd.edu, remove "NOSPAM.") to change to using OSG WN TMP, but it would be better for us to just set this Condor setting ourselves.
- Look into Condor configuration for how users write job input/output to /tmp and transfer at the end of job
- Look into Condor memory limits (we were supposed to have them implemented, they don't seem to be running properly?)
- PhEDEx:
- Have a cron job which makes the phedex tar ball every week.
- Add info on deleting data, in particular, that the directories must be manually cleaned up after a PhEDEx deletion request completes.
- Currently using find to clean /home/uscms01. However, best solution is to follow this: https://twiki.grid.iu.edu/bin/view/Sandbox/MaradonnaWorkerNodes so that gLite jobs go directly to the WN /tmp instead of /home/uscms01.
- Get backup script to auto-copy to the hep-t3 webserver?
- OSG GridFTP-Hadoop installation (done apparently?)
Sandbox
Holding condor jobs
To hold all the jobs running on the condor batch system, as root from the HN:
condor_status -schedd
For all nodes listed as the scheduler for running or idle jobs (e.g. compute-x-y):
ssh compute-x-y
condor_hold -name compute-x-y -all
To resume jobs:
condor_status -schedd
For all nodes listed as the scheduler for held jobs (e.g. compute-x-y):
ssh compute-x-y
condor_release -name compute-x-y -all
There must be an easier way to do this, but I don't know what it is! cluster-fork "condor_hold -all" will only hold jobs submitted by the root user.