Running parallel jobs with MPI

Parallel jobs are of different types. For those to be run on a distributed environment such as a cluster there is the need of a special daemon to be started on the execution host(s) before they can run. In SGE parlance these special requirements are met by a parallel environment (PE). There are several PEs configured on the cluster and other PEs can be configured on demand. The PEs installed can be seen by qconf -spl and their specific details by qconf -sp mpich. PEs set up the policies with which parallel jobs are distributed. A “fill up” policy for allocation_rule indicates that process slots will be allocated greedily per node basis. That is, if a node has 8 slots for the indicated queue and they are all free they will all be assigned to the parallel job sent.

In order to use OpenMPI, it is mandatory to use the flag -binding linear:256.

When submitting a parallel job in an SGE system, one needs to specify the PE for that job and the number of parallel tasks/threads that the job will spawn. OpenMPI will automatically detect when it is running inside SGE due to the integration with the system at the time of installation. Specifically, if you execute an mpirun command in a SGE job, it will automatically use the SGE mechanisms to launch and kill processes. There is no need to specify what nodes to run on — Open MPI will obtain this information directly from SGE. Namely, the SGE system will allocate the slots and write them in an hostfile that mpirun will read before starting. (The hostfile can be seen in one of the log files left by the system in the launching directory.)

The following is an example for running 4 copies of a single-thread executable a.out:

# Allocate a SGE interactive job with 4 slots
# from a parallel environment (PE) named 'mpich'
shell$ qsub -pe mpich 4 mpirun -np 4 a.out

Here is a more involved example with an R script that uses Rmpi and spawns slaves internally. Let’s take the sample from and put it in sample.R. Changing the line with mpi.spawn.Rslaves() into mpi.spawn.Rslaves(nslaves=5).

We then need a wrapper for qsub. Let’s call it

# Run using bash
#$ -S /bin/bash
# The name which shows up for this job in the grid engine reports
#$ -N reportwrap
#$ -l xeon5410
#$ -l medium
#$ -binding linear:256
# The number of processors required - 1 for master, plus whatever
# number for slaves.
# This example runs with a master, plus 5 slaves
#$ -pe mpich 6
# Run in the current directory
#$ -cwd
# Run the job. Now we want one copy of our script running, hence -np 1.
/opt/openmpi/mpirun -np 1 R --slave CMD BATCH sample.R

Finally, we can send to the queue system:
shell$ qsub

While running the slaves will produce some log files in the launching directory with the name of the nodes where they are executing. Check that they are spread among the nodes as expected. This files will be automatically deleted after job completion. Further output is placed in files reportwrap.#####. Among these files you can see a copy of the temporary hostfile generated by SGE. Check that the content of the hostfile corresponds to the place where the slaves are effectively running. Finally the flag --slave in the call to R tells the program to run quietly and send all output to sampe.Rout.

To see which queue is most convenient at the moment to send job you may use qstat -g c.

Other sources of information for the integration of OpenMPI with SGE are:

Leave a Reply