SGE Arraytasks and Matlab
Recently I’ve had users of CIRCE in need of being able to take a set of input values and iterate over them in parallel in Matlab to save time. Most people would consider utilizing the Parallel Computing Toolbox, but that would require modification of code, as well as a license for the toolbox itself. I have an alternative to allow users to run their Matlab code in parallel with minimal modification in conjunction with an arraytask in Sun Grid Engine. I’m sure other schedulers support arraytasks, so this can be adapted to whatever scheduler is in production.
To review- an arraytask is a type of HPC job that allows you to run a piece of software multiple times simultaneously, each time with a different set of inputs. This is usually done for compute tasks that are embarassingly parallel in nature. Here’s an example of a simple arraytask in SGE:
#!/bin/bash
#$ -N my_app_array_run
#$ -o output.$JOB_ID
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=00:45:00
#$ -t 1-48
./my_app inputfile.$SGE_TASK_ID
This creates 48 tasks to run the program ./my_app with the argument inputfile.1 to inputfile.48 with each task. Now, how do we use this tool with Matlab?
Environment variables. First, use this to export the appropriate value to the environment:
#!/bin/bash
#$ -cwd
#$ -l h_rt=1:00:00
#$ -j y
#$ -N matlab_arraytask
#$ -o output.$JOB_ID
# HOW MANY TASKS?
#$ -t 6
task=1
# SPECIFY A LIST OF INPUTS EQUAL TO NUMBER OF TASKS
for i in 1 2 3 4 5 6; do
if [[ "$task" -eq "$SGE_TASK_ID" ]]; then
INPUTVALUE=$i
fi
let task=$task+1
done
export INPUTVALUE
matlab -nodisplay -r function
Next, add this to your Matlab code to convert the environment variable into a Matlab variable:
inputvalue = str2num(getenv('INPUTVALUE'));
With this, and sufficient computing resources, you can cut the time needed to run all of your inputs to the time needed to run just one.