June 2008 log
June was devoted to advanced Rocks configuration and grid installation and configuration.
June 25, 2008
MK--Fixed srm
- srm clients that ship with OSG are out of date and not compatible with latest BeStMan. Installed newer dCache srm client and added to all user's PATH, as well as adding the BeStMan srm client to all user's PATH. Aliased srmcp -access_latency=ONLINE to srmcp.
June 24, 2008
MK--Continued RSV fixes, srm tests
- Changed OSG registration to state we provide the BeStMan-Xrootd service. Although we only have BeStMan installed, as far as I know, OSG only cares about setting the srm path this way.
- RSV srmcp test was failing. Thought this was due to checksum, but after further investigation, it appears I have a return of the error regarding access_latency. Even after setting to ONLINE, I receive the same error. I have also been unable to get srm-copy to copy files from srm to file without using the srmcache keyword. Realized I'd never tested this scenario, so suspect this is a BeStMan issue, not RSV. TODO
June 23, 2008
MK--Brought up GIP monitoring package, fixed some failed RSV tests
- Not writing in How-To guide for now on how to configure GIP monitoring package. Kept notes if decide to later. GIP monitoring linked on home page as well as a tab on hepcms-0.umd.edu/wordpress
- Debugged on RSV failed test. Worked with two others, but no solution yet (TODO).
June 20, 2008
MK--Got RSV V2 running, numerous small tasks
- RSV folks had not propagated latest changes to the OSG 0.8.0 distribution area. They updated the configure script and after downloading & executing, it worked just fine. We are failing srmcp tests, unknown why (TODO).
- Found output for RSV monitor as well as other monitors and linked on this website. (TODO: Find out where gratia monitoring info shows up)
- Created cron job to clean up PhEDEx load tests. (TODO: move to /data/se/store from /data/store)
- Worked with user passwords issue. Jeff had to use his old password and change it every time on the WNs. Had him login on HN. TODO: determine if sync users or make 411 must be called to propagate new passwords from HN to WNs, if so, have a cron job do this, or figure out how to get 411 to do it automatically.
- We show up in VORS and BDII as having versions of CMSSW (yay!).
- Worked with users' PATH, although backed out changes later. Note: to edit PATH for both bash and c-shell users, edit /etc/profile and /etc/csh.login.
- Turned off Condor SUSPEND functionality, as CMSSW jobs cannot be suspended successfully. (TODO: give uscms01 user low priority, determine where Condor activity logs are being reported by gratia).
June 19, 2008
MK--Published CMSSW versions to OSG BDII, worked on RSV V2
- Modified /share/apps/osg-app/etc/grid3-locations.txt to include the slc4_ia32 environment as well as our two versions of CMSSW installed. BDII will update itself with this new information eventually. Not sure how to verify this (TODO).
- Restored RSV V1, verified that can still get it running. Attempted to upgrade to RSV V2 again, but encountered (nearly) the same errors. Sent email off to get help. Investigation of RSV V1 indicates that it is not possible via the configure script to specify the srm path name as "srm/v2/server" for BeStMan instead of the default dCache value of "srm/managerv2." Therefore, RSV V2 is highly desirable. Short investigation of BeStMan documents didn't show the ability to get rid of the v2 part of the srm path (I do know how to change server). This may be desirable because OSG is not the only software which we suspect assumes the dCache path.
June 18, 2008
MK--Installed CMSSW, verified OSG registration, began PhEDEx SE configuration
- apt-get lock issue was because I wasn't sourcing the scram apt configuration script prior to calling apt-get update. Installed CMSSW with no problems.
- Our site now shows up in BDII and VORS. RSV service still down as V2 install went badly. May revert back to V1 soon.
- Now that we are registered with OSG, I configured PhEDEx with our CE and SE FQDN. Concerned that this configuration is for EDG sites only (TOCHECK).
June 16, 2008
MK--Registered with OSG, prepared for gLite tarball installation, worked with CMSSW issue
- Attended OSG meeting and modified a few entries to proceed with registration. Attempted RSV V2 install, encountered an error.
- Decided that configuration parameters entirely unknown and undocumented for gLite UI installation done the 'proper' yum way. Config parameters are documented for gLite UI tarball installation and FNAL does the tarball installation as well. Prepped crab-node-0-0 by removing yum install of gLite UI and re-shot the node. Node now ready for tests of gLite UI installation via tarball.
- Suspect apt-get lock issue is because a user holds an rpm or similar lock somewhere, need to figure out ps command to find the process and kill it, or reboot the cluster.
June 13, 2008
MK--Began CMSSW installation
- Had not yet installed CMSSW since latest cluster re-install, so took some time to do so today. Bootstrap script executed OK. Unable to issue apt-get update command as apt-get is unable to obtain lock.
June 12, 2008
MB & MK--Ordered spare hard disks, continued OSG registration process, began WN security
- Ordered one spare disk for the HN, the big disk array, and the WNs. 2 year warranty should cover disk failures, but having the disks on hand will dramatically improve our response time in the event of failure.
- Began site registration. Need to attend OSG meeting before they will complete.
- WNs do not actually inherit security settings from the HN automatically. Getting 411 to propagate files is difficult (according to Joe Kaiser), so will put into Kickstart (TODO).
June 11, 2008
MK--Reconfigured for OSG SE, began OSG registration process
- Now that we know how to use BeStMan to create files exactly where we want, without the ugly directory names, we can use a SE. Reconfigured OSG for a SE.
- Need to reconfigure PhEDEx for SE (TODO).
- Becan OSG registration process, waiting on new member approval.
June 9, 2008
MK--Discovered BeStMan does allow access to filesystem without nasty directory names.
- Turning the srmcache keyword on allows the SFN to contain absolute path names so we can place files exactly where we want them. With the srmcache keyword in the SFN, files go to the srm-managed directories, which are created with nasty directory names that manages information such as expiration date. The contents of the srmcache can only (easily) be brought in and out using srm. These directories will not be used for the far majority of transfers in the future, but are enabled. So, to place files exactly where we want:
srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/data/users/srm-drop/whateverfile.ext
Now the file whateverfile.ext is placed in /data/users/srm-drop
And to place files in the srm-managed cache:
srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/srmcache/~/whatever.ext
Now the file whatever.ext is placed in /data/se/<custodial,output,replica>/uscms01/<letters-and-numbers>
June 8, 2008
MK, JT, PR, TT--Configured BeStMan and configured OSG & PhEDEx to operate with and without the SE
- Discovered srm path which BeStMan creates to our server. Initial tests indicated the file structure created by srm was not workable with CMSSW. Therefore configured OSG & PhEDEx to work without a SE, but saved info on how to configure with an SE.
June 7, 2008
MK, JT, PR, TT--Configured OSG & PhEDEx
- Reconfigured OSG for new SE.
- Verified OSG settings (fixed some).
- Continued registering our site with PhEDEx (held up by lack of knowledge of how BeStMan controls our SE).
- Tested SE, was able to make a copy from a local file in hepcms-0 to the SE via srm-copy. BeStMan creates a subdirectory for every file that has a yucky name, this is obviously highly undesirable. It also appears that the SFN is not desirable for PhEDEx, /srmcache/uscms01 would be better as /store.
June 6, 2008
MK, JT, PR, TT--Began install of OSG on HN and PhEDEx on compute-0-7
- Completed OSG install and initial configuration.
- Installed BeStMan. srmcp attempts failed (we think user error, don't know what the full srm path is, TODO)
- Registered site with PhEDEx and site database.
- Installed PhEDEx (not configured).
June 5, 2008
MK--Completed reinstall of the cluster, mounted /data on all WNs, installed gLite-UI on one worker node.
- Condor not verified (TODO) and CMSSW not installed (TODO), but WNs fully configured with emacs, xemacs, RPMforge, and network mounts of all disks.
- Configured new CRAB appliance and replaced compute-0-6 with crab-node-0-0. (still hepcms-7.umd.edu).
June 4, 2008
MK--Reinstalled cluster from scratch, twice.
- Same issue as before, WNs went into install sequence which required keyboard input. Suspect first time was because attempted screen shoot-node, followed by logging out, which hung WN installation midway. Suspect second time was because attempted to install WNs before configuring switch correctly. WN installation also hung midway and was not salvagable.
- Discovered WNs can be salvaged from this state by forcing default partitioning and removing all modified .xml files in /home/install/site-profiles/4.3/nodes and creating a new 'default' distribution. WNs still have to be restarted manually to get out of the interactive install, but at least the entire cluster doesn't have to be reinstalled.
June 3, 2008
MK--Discovered all WNs failed shoot-node, installed cluster from scratch. Installed emacs on WNs.
- After issuing shoot-node command to WNs, nodes went into install sequence which required keyboard input (??). Looked a lot like normal SL install sequence, rather than Rocks install. Suspect problem with initial WN install - accidentally exited insert-ethers command on the HN mid-WN install. Reinstalled entire cluster (again).
- Successfully installed emacs on all WNs.