Using SPIDER on IBM SP Clusters with Job Scheduling

This page illustrates the usage of some SPIDER operations that are helpful in creating and controlling the execution of multiple SPIDER jobs running in parallel on a loosely coupled clustered system which has a complex job scheduling system to assign nodes to a user and regulate processing time.

Example: Alignment of Single Particles

bjob

Schedules node use using LoadLeveler on IBM SP clusters. To run : llsubmit bjob

#@ job_name        = bjob                 
#@ output          = bjob.out 
#@ error           = bjob.err 
#@ job_type        = parallel 
#@ network.MPI     = css0,not_shared,us 
#@ node_usage      = not_shared 
#@ environment     = COPY_ALL 
#@ notification    = complete 
#@ class           = regular 

#@ tasks_per_node  = 1 
#@ node            = 25 

# 
#@ wall_clock_limit= 7:50:00 
#@ queue 

date 
#reserve one node for each SPIDER job
poe -nodes 25 -procs 25 -pgmmodel mpmd -cmdfile bjob.cmd 
date 

bjob.cmd

Starts master and idle tasks on each node.


#- command --- Project/ Data  -- intial ----- results --- reg. 
#              extension       procedure   file number    setting   

./spider        pam/acn        @b_master        0         x11=11
./spider        pam/acn        @b_idle          1         x77=16
./spider        pam/acn        @b_idle          2         x11=17 
./spider        pam/acn        @b_idle          3         x11=18 
..
..
..
./spider        pam/acn        @b_idle         24         x11=40  

b_master.pam

Master task. Started on one node only. Coordinates and synchronizes all tasks.

; ArDean Leith Nov 2000

; INPUT:
; x12 (Starting micrograph number)
; x13 (Ending micrograph number)
; myinput/reference_volume (3-D input file)
; myinput/win_part@***** (2-D projections)
; myinput/ngood{***mcg} (Selection doc files)

; OUTPUT:
; out/prj**** (Projections)
; select (Doc file)
; refangles (Doc file)
; out/apmq{***x77} (Doc files)

x12 = 16 ; starting micrograph number
x13 = 40 ; ending micrograph number

MD
TR OFF ; decrease output to results file
MD
VB OFF ; decrease output to results file
MD
SET MP ; use SMP on 2 processors per node
2

VM ; dir: out{..} NEEDED
mkdir out

x11=1
; activate slave task for each micrograph
DO LB1 x77=x12,x13
; create sync document files with register settings for each slave
@b_startslave[x11,x41,x42,x51,x52,x55,x66,x76,x77]
LB1

VM
echo "b_master waiting for all alignments"
MY FL ; flush results

; wait for alignments to finish
@b_wait[x11,x12,x13,x47,x66,x76]

; alignments finished, signal slaves to end
x11=99
DO LB3 x77=x12,x13
@b_startslave[x11,x41,x42,x51,x52,x55,x66,x76,x77]
LB3

EN

b_startslave.pam

SPIDER procedure called by: b_master that creates a doc. files for each group. The doc. file is used to signal the startup of processing by the b_idle tasks and also passes info to the b_idle tasks.

[x11,x41,x42,x51,x52,x55,x66,x76,x77]
; ArDean Leith Nov 2000
; Creates doc files used to wake up and pass info to idle tasks

; INPUT
; reg: 41
; reg: 42
; reg: 51
; reg: 52
; reg: 55
; reg: 66
; reg: 76
; reg: 77

; remove any existing document file for settings & to sync files
VM ; remove old sync doc. file for this group
\rm -f jnkdoc{***x77}.$DATEXT
; create document file with register settings
SD 11,x11 (contains type of slave flag)
jnkdoctmp{***x77}
SD 41,x41
jnkdoctmp{***x77}
SD 42,x42
jnkdoctmp{***x77}
SD 51,x51
jnkdoctmp{***x77}
SD 52,x52
jnkdoctmp{***x77}
SD 55,x55
jnkdoctmp{***x77}
SD 66,x66
jnkdoctmp{***x77}
SD 76,x76
jnkdoctmp{***x77}

SD E
jnkdoctmp{***x77}
VM
mv jnkdoctmp{***x77}.$DATEXT jnkdoc{***x77}.$DATEXT
RE

b_idle.pam

Started on each node execept for the master node. This task waits for the existence of a start-up file: jnkdoc{***x77} created by: b_master When the signal (file) arrives, this procedure calls SPIDER procedure b12.pam which carries out the alignment for this group. When the alignment is finished, this procedure creates a new doc. file: jnkdocparamout{***x77} which signals _bmaster that it can re-awaken.

; ArDean Leith Nov 2000

; INPUT:
; reg: 77 (group, on command line)
; jnkdoc{ } (doc file created by b_master & b_startslave)

; OUTPUT:
; jnkdocparmout (signal file contains x11 & x47)

x77 ; group must be on command line!!!!!
MD
TR OFF ; decrease output to results file
MD
VB OFF ; decrease output to results file
MD
SET MP ; use SMP on 2 processors per node
2

; Awakens on signal from b_master ----------------------------
; Runs following operations for each awakening (100000=infinite)
DO LB1 i=1,100000
IQ SYNC ; wait for wake-up signal (file: jnkdoc{***grp}
jnkdoc{***x77}
(10 36000)

; retrieve registers stored in doc file: jnkdoc{***x77}
UD IC,11,X11
jnkdoc{***x77}
IF (x11.GE.99) THEN
; signal to kill this slave task
EN
ENDIF

UD IC,41,X41
jnkdoc{***x77}
UD IC,42,X42
jnkdoc{***x77}
UD IC,51,X51
jnkdoc{***x77}
UD IC,52,X52
jnkdoc{***x77}
UD IC,55,X55
jnkdoc{***x77}
UD IC,66,X66
jnkdoc{***x77}
UD IC,76,X76
jnkdoc{***x77}

UD ICE
jnkdoc{***x77}

VM ; remove this sync. doc file
\rm -f jnkdoc{***x77}*

VM
date
VM
echo "starting step: {**x76} group: {**x77}"

X11
MY FL ; flush results file
<\P>
IF (x11 .EQ. 1) THEN
@b12[x77] ; runs alignment for this group.
ENDIF

; Signal b_master to re-awaken now
; (b_master wakes when it sees jnkdocparamout{***x77})
SD 11,X11 ; set sync file output
jnkdocparamout{***x77}

SD E
jnkdocparamout{***x77}

VM
echo "ending iteration: {**x76} group: {**x77}"
LB1

EN

b12.pam

Started on each processor by b_idle.pam Aligns particles to reference projections.

[x77] ; ArDean Leith Nov 2000

; Aligns particles to reference projections.
; Multireference alignment of an image series. For
; project with multiple defocus settings, run this program
; separately for particles from each individual micrograph.

; If pixel size is different than 4.78, expected size of object and
; first and last ring parameters should be changed

; INPUT:
; out/prj**** (2-D ref. images)
; select (Selection doc file for refs. from b11.pam)
; scratch/leith/win_part@***** (Windowed images)
; myinput/ngood{***grp} (Selection doc files for windowed images)

; OUTPUT:
; out/apmq{***x77} (Alignment doc files)

MD
TR OFF ; decrease ouput to results file
MD
VB OFF ; decrease ouput to results file
MD
SET MP ; use SMP on 2 processors per node
2

MY FL ; flush output

AP MQ ; Alignment - 3D, multi reference
out/prj**** ; Template for 2-D reference image names (input)
select ; Selection doc. file for reference imgs. (input)
(10,1) ; Accuracy of the search
(5,47) ; First and last ring
/scratch/leith/win_part@***** ; Windowed images (input)
myinput/ngood{***x77} ; Windowed images selection doc. file (input)
out/apmq{***x77} ; Angles output file (output)

MY FL ; Flush output

RE

b_wait.pam

b_master running on the master node calls this procedure after awakening the b_idle tasks to carry out the alignment. When an alignment is finished, b_idle creates a new doc. file: jnkdocparamout{***x77}.) This procedure causes b_master to wait for the creation of these files from each of the b_idle tasks.

[x11,x12,x13,x47,x66,x76]
; ArDean Leith Nov 2000

; Used in b_master. Waits for slaves to finish.
; For step id=2, accumulates register 47 contents from
; sync doc file.

; INPUT:
; reg: 11 (step id)
; reg: 12 (startinggroups)
; reg: 13 (ending groups)
; reg: 66 (number of groups)
; reg: 76 (step number)
; jnkdocparamout{***grp}*

; OUTPUT:
; reg: 47 (acummulated reg #47)

x12 ; echo reg 12
x13 ; echo reg 13
x47=0 ; initialize return value

; wait for all micrograph groups -------------
DO LB3 x76=x12,x13
X77=56-x76 ; count down since group 16 is so long
x77
MY FL ; flush results
IQ SYNC
jnkdocparamout{***x77}
(10 36000)
VM
date
VM
echo "synced step: {**x76} group: {**x77} "
;
IF (X11 .EQ. 2) THEN
; b_defloopa sets x47 in jnkdocparamout{***x77}
UD 47,x12
jnkdocparamout{***x77}
x47=x47+x12
UD E
jnkdocparamout{***x77}
ENDIF
DE
jnkdocparamout{***x77}
;
MY FL ; flush results
LB3 ; end wait loop over groups -------

RE


Source: techs/parallel/parallel_ibm.html     Last update: 26 April 2001     ArDean Leith