|
|
|
|
|
New CLuster Assembly Journal
- Jul 30, 2004
- Looks like the down nodes are 00,38,51,55. 00 won't netboot, 38 installed,
but appears to have some hardware issues after installation. Possibly a bad disk or other, it's out of the pbs pool for now. 51 and 55 shoulda installed, but didn't.
OK, 51's up and running, 55 just didn't get config'd in the bios and is installing now.
- Jul 22, 2004
- NAG Fortran 95 compilers installed in /usr/local/stow/NAGWare-f95
- Jul 21, 2004
- 61 node LAM mpi job over Myrinet works. Any node that appears up when running 'pbs_nodes -a' can be used to do whatever. LAM has been installed with native
access to myrinet and the tm PBS modules. This means that jobs can be started
without the earlier pbslam script on other clusters. Calling lamboot, mpirun and other friends as you would on stand-alone nodes is good enough. Sample submit script:
#!/bin/bash
#PBS -l nodes=61
#PBS -l walltime=40:00
/usr/local/bin/lamboot $PBS_NODEFILE
/usr/local/bin/mpirun -ssi rpi_verbose level:1000 C alltoall
/usr/local/bin/lamhalt
- Jul 17, 2004
- 61 node ssh test passed. Time to get MPI up and working. The newrhosts
command has been decrecated in favor of using passwordless ssh keys. Please
see the updates cluster manual for instructions for setting up your keys.
- Jul 16, 2004
- Finish PBS/ssh true issue testing and test a 21-node job. First 21 nodes are open for public testing. Remaining two racks installed and successful except for nodes 00, 51, and 55. Tomorrow if there's time we can test a 61 node job.
- Jul 14-15, 2004
- finish kickstart testing, roll maui/pbs/condor RPM's. Tim rolled myrinet driver RPM and tested. A few nagging bugs about condor and the post-install. Switched from campus rhn proxy to redhat's due to scalability issues on more than 5 node installs. First bulk install of 25 nodes in the first rack. All well except bug00 won't dhcp itself.
- Jul 12-13, 2004
- Myrinet work and some simple node install tests
- Jul 7-9, 2004
- Grunt labor time, set the bios to serial, netboot and gather MAC addresses for all the machines.
- Jul 6-7, 2004
- power on worked, time to try and get the first nodes to kickstart. Initial switch and managed power configuration. Cluster is on a 100mbit uplink until fiber arrives for the uplink.
- Jul 1, 2004
- New Cluster arrives and it's time to assemble it. Power on will be Monday to let all components adjust to the temperature and any condensation to evaporate.
|
|
|
74 boxes, 9 pallets |
|
|
|
|
|
Finished! All 70 nodes racked. |
- June 19-24, 2004
- Clean out the old SP2 and get the space prep'd for the new arrival
|
|
Student Staffers remove dozens of old disks :) |
|
|
|
|
|
|
|