Installing and Managing SPIDER's PubSub System for Distributed Processing

Introduction

With PubSub, SPIDER procedures can be run in parallel on a distributed cluster of computers or within a single cluster. The user places his SPIDER job in a shared que. Each of the subscriber machines can take jobs from the que. Each subscriber machine can specify when it will take jobs and how many jobs it can take at a time. If the machines vary greatly in processing power, it is best to partition the SPIDER jobs so that they will take a reasonable length of time (e.g. 20...100 minutes) so that the subscription process is most efficient.


Installing PubSub

Requirements for PubSub

  1. Systems must have Perl and standard POSIX utilities.
    (If Perl is not located in: /usr/bin/perl you will have to place a link there or alter the first line of each Perl script.)

  2. Systems must have disks cross-mounted so that they are accessible from all processors using same path e.g. /net/location/.

  3. Systems must be able to use rsh to run operations remotely on all computers in the cluster. If your systems only support rsh you will have to alter the Perl scripts.

  4. The file which will be used for the 'publisher que' must have 'read/write' permissions suitable for all users.

PubSub Software Installation

  1. Create environment variables PUBSUB_DIR for the location of the PubSub installation directory and PUBSUB_MASTER for name of the host where PubSub master is run. These environment variables must be set in each users startup file (i.e. for csh users in their .cshrc file) e.g.
    setenv   PUBSUB_DIR   /usr8/spider/pubsub
    setenv   PUBSUB_MASTER   radha

  2. Following steps should be done when logged in on PUBSUB_MASTER as member of group that is planning on using PubSub NOT as: root

  3. cd PUBSUB_DIRECTORY e.g.
    mkdir   $PUBSUB_DIR ;   cd $PUBSUB_DIR

  4. Copy the PubSub files normally distributed in: SPIDER_DIR/pubsub to your PUBSUB_DIRECTORY e.g.
    cp   SPIDER_DIR/pubsub/*   $PUBSUB_DIR

  5. Edit pubsub.permit e.g.
    xedit pubsub.permit
    Set machine specific permissions. Currently contains: machine name, limit for number of simultaneous jobs, permitted run days, permitted start-time, permitted end-time, and que check frequency (seconds), and comments. The machine names here determine where jobs can be run.

  6. Create an empty que file e.g. touch pubsub.que

  7. Tune NFS (if master node responds slowly).
    If your master and compute nodes node will be accessing lots of data from a NFS mounted disk you may want to speed up the process by altering the /etc/fstab mount for the data disks to increase the read and write buffersize e.g.:
    tonga2:/usr6 /usr6 nfs rsize=8192,wsize=8192 0 0
    See: NFS tuning for discussion.

    You also may want to increase the number of nfs threads on the master node and any other machines where the data is located using:
    /usr/sbin/rpc.nfsd nproc
    This should be placed in your init file so that it will be preserved on reboot. On redhat GNU/Linux this is set in: /etc/rc.d/init.d/nfs. Both changes will require root access to the machine. See your Unix manual pages for: fstab & nfsd


Managing PubSub

Starting PubSub

  1. Login to the PubSub master    e.g.
    ssh radha
  2. cd YOUR_PUBSUB_DIRECTORY    e.g.
    cd   $PUBSUB_DIR
  3. Run startsub.perl to start a subscription process on this master node. If this process dies you will have to restart it again in the same way.    e.g.
    startsub.perl

Removing a compute node from PubSub Use

  1. Login to the PubSub master    e.g.
    ssh radha
  2. cd YOUR_PUBSUB_DIRECTORY    e.g.
    cd   $PUBSUB_DIR
  3. Edit: pubsub.permit and comment out the nodes by adding a '#' before machine name    e.g.
    #node105 1 7 00:00 23:59 60

Removing a Rogue Job from PubSub Que

  1. Login to the PubSub master    e.g.
    ssh radha
  2. cd YOUR_PUBSUB_DIRECTORY    e.g.
    cd   $PUBSUB_DIR
  3. Run status to list process numbers id's que.    e.g.
    status
  4. Run fixque.perl to delete process from que. (Keep the negative sign if present) e.g.
    fixque.perl process id

List jobs on compute nodes

  1. Login to the PubSub master    e.g.
    ssh radha
  2. Run wherespi e.g.
    wherespi

Killing your master SPIDER job under PubSub

  1. Login to the PubSub master    e.g.
    ssh radha
  2. Run ps to get process id.    e.g.
    ps -ef | grep username
  3. Kill the process.    e.g.
    kill -9 process id

Killing your compute node SPIDER jobs under PubSub

  1. Login to the PubSub master    e.g.
    ssh radha
  2. Run killspi to delete all of your SPIDER jobs running on all compute nodes from this cluster.    e.g.
    killspi

Running SPIDER jobs using PubSub

    See use of PubSub.


PubSub components

Note: You do not need to understand this to utilize PubSub. Perl code which may have to be altered since it may be site specific is marked with %%%% in the source files.

startsub.perl
Start subscriber process.

subscribe.perl
Subscriber process. Watches publisher que for any new jobs. If job appears, the subscriber looks for a suitable machine. When a machine is found the subscriber signals the publishing process where to run the job. This subscriber process checks the publisher que at specified frequency until it dies or is "killed'.

publish.perl
Submits a job to the publisher que. System flock is used internally to avoid update collisions.

delete.perl
Places job statistics in pubsub.log when a job is finished. System flock is used to avoid update collisions.

pubsub.permit
A single shared file containing machine specific permissions. Currently contains: machine name, limit for number of simultaneous jobs, permitted run days, permitted start-time, permitted end-time, and que check frequency (seconds), and comments.

pubsub.que
Publisher que. This is a single shared file that is accessed by the subscriber process to obtain jobs. System flock is used to avoid update collisions. The job number becomes negative when a job is 'subscribed' to. Jobs are deleted from the que when delete.perl runs.

killsub
Kills the PubSub subcriber process.

wherespi
Should tell you where SPIDER is currently running on all nodes. This is currently specific to our installation.

pubsub.log
PubSub log. This is a file that is created in the user's directory to log job progress. System flock is used to avoid update collisions. The run time for the job run is recorded here as well as the node name.


Source: spider/pubsub/pubsub_inst.html     Last page update: 10 Aug. 2010     ArDean Leith