I'm Alex Kearney, a PhD student studying Computer Science at the University of Alberta. I focus on Artificial Intelligence and Epistemology.
I've been running some scripts on west-grid recently, so I've collated the information I've gathered from reading through the guides.
Because I've been running parameter sweeps with multiple algorithms over a nice data-set, I've had to change the way I run experiments. I just don't have the capacity to run experiments in a sane amount of time on my machine. To get around this, I've setup my code to run experiments on west grid, a research computing system.
In the process, I've written up a little guide for running experiments politely. This is mostly a refrence for future-me.
Note: This isn't particularly exhaustive: it's just an introduction; make sure you read west-grid's guides.
Jasper meets my needs, so I'll use it for the overview. The technical specifications are:
Jasper is an SGI Altix XE cluster with an aggregate 400 nodes, 4160 cores and 8320 GB of memory. 240 nodes have Xeon X5675 processors, 12 cores (2 x 6) and 24 GB of memory. Of these, 32 have additional memory for a total of 48 GB. 160 nodes, formerly part of the Checkers cluster, have Xeon L5420 processors, 8 cores (2 x 4) and 16 GB of memory.
| Resource | Limit | | ---------------------------- | -------------- | | Maximum Walltime | 72 hours | | Maximum Running Jobs | 2880 | | Maximum Jobs Submitted | 2880| | Maximum Jobs in Queue | 5|
What walltime means is that the longest you can run a job for is 72 hours. After 72 hours, the job will be killed. You can get around this, by writing a script that performs your job in chunks, meaning that even if it terminates, you can pick-up where you left off. In fact, this is advisable. This way, if you set your wall time too low and your job is aborted part way through, you can pick-up where you left off without the hassle of re-running everything.
The maximum jobs you can have running or submitted at any one time is 2880. That means that the number of blocked jobs that area waiting to be executed may not exceed 2880, and the numer of running jobs may not exceed 2880.
When a job is submitted via qsub
it is put through a scheduling system. The scheduler balances fairness with utilization in a number of ways. The actual algorithm is publicly available, but I've just found it's easier to submit a large number of short jobs, rather than a
#!/bin/sh
#PBS -S /bin/sh
#PBS -j oe
#PBS -r n
#PBS -o logs/filename.$PBS_JOBID.log
#PBS -l nodes=1:ppn=1,walltime=0:20:00,mem=1gb
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"
python experiment.py --horizon=50
echo "Completed run with exit code $? at: `date`"
#!/bin/bash
for s in s1
do
for a in a1 na1
do
for alg in autotd td tdr totd
do
echo '#!/bin/bash
#PBS -S /bin/bash
#PBS -M kearney@ualberta.ca
#PBS -m bea
#PBS -l walltime=01:00:00
#PBS
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
module load application/python/2.7.3
time python ./pysrc/experiments/prosthetic-experiment.py 1000 '$runseed' ~/usage-td-experiments2/usage-td-experiments/results/rndmdp-experiments/state-100-ftype-binary/ '$alg' > '$alg'-'$runseed'.txt' > $alg-$runseed.pbs
qsub $alg-$runseed.pbs
done
done
done
list available space on your account
lfs quota -u kearney /lustre
shows all the jobs associated with Kearney
showq -u Kearney
delete jobs
qdel jobid
To immediately kill all your current jobs:
qdel $(showq -u yourname | awk {'print $1'})
This takes the first parameter from showq---the process id---and tells the scheduler to delete it.
For more on running jobs, look here