Grid Computing


After completing this tutorial, you will be able to:

  • explain how to access each Grid Scheduler command from within R.
  • explain what Grid Scheduler is doing and describe its place in the overall system.
  • invoke Grid Scheduler commands qstat, qalter, and qdel from within Rstudio and at the shell prompt.
  • understand output and be able to interpret qstat command (qstat -f).
  • understand when a job is pending due to problems (Eqw job state) .
  • understand when a queue is in alarm or error state and know whom to notify.
  • understand how to decrease priority of owned jobs using qalter command and what it means to increase/decrease priority.
  • kill a large number of jobs from within RStudio.

Introduction

NONMEM is more powerful when models -- or sets of models -- can be distrubuted across multiple processors. In this context, Grid Scheduler manages the multiplicity. We interact with a grid of processors by means of Grid Scheduler commands. These can be executed at a shell prompt, or can be passed from R using the function system().

qsub

qsub is used to submit a job to a queue.

qstat

qstat as well has a thin wrapper qstat() in metrumrg that can be used directly in R. It causes the status of the queues to be printed.

qstat()
queuename        qtype resv/used/tot. load_avg arch   states
---------------------------------------------------------------------
all.q@master                   BIP   0/1/8      0.04    linux-x64     
    562 0.55500 Run001     user0001     r     11/19/2014 11:59:15     1 

This call shows that user0001 has one job (Run001) running with JobID=562 and we are currently using one of the 8 slots (0/1/8) available on this machine. (The number of available slots will depend on the number of cores in the Metworx^TM^ workflow and will range from 4--32 slots.) The items that are important in the qstat output are JobID, run number, user, and any errors that might be shown. (For this qstat output there are no errors. An example of a qstat output with an error will be shown later in this chapter.)

qdel

This call will delete all of your jobs.

qdel -u username

This call can also utilize the individual JobID number. Lets look at example qstat output from above.

queuename        qtype resv/used/tot. load_avg arch   states
---------------------------------------------------------------------
all.q@master                   BIP   0/1/8      0.04    linux-x64     
    562 0.55500 Run001     user0001     r     11/19/2014 11:59:15     1 

The JobID for Run001 is 562. If we run the code below:

qdel 562

This call will delete only JobID=562. This represents a nice way of deleting a single run when you may have other jobs running. NOTE: If you get the JobID wrong, you will not delete any job unless you have chosen a JobID that is curently running.

You may find it necessary to delete a list of jobs, say bootstrap jobs, that were submitted in error but you also have other jobs running. One possible way is to loop over the JobID's you want to delete in R. Lets assume the jobID's go from 600-1600.

In R you can "shell out" to run commands that you would run on the terminal via the system() command:

for(i in 600:1600){
 system(paste('qdel ', i, sep=""))
 }

The block of code above will delete JobID 600--1600 from within RStudio. You could also substitute a vector of run numbers if the JobID's were not evenly spaced or consecutive.

qalter

qalter is used to 'alter' something about the running or pending job. For example, a user just submitted 500 bootstrap runs that loaded up the Workflow and then they need to submit some additional runs for a different project or model before the bootstrap runs have completed. The Grid Scheduler system is setup to allow fair grid sharing by using the first in -- first out approach within a user. So, the additional runs for the user who just submitted 500 bootstraps will not start until their bootstrap runs are complete. In most cases, you would prefer the one-off model runs to take priority over the bootstraps. This is a time when qalter could be used to decrease the priority of the pending bootstrap runs so the one-off model will fill the next available slot. Lets take a look at some qstat output below.

queuename        qtype resv/used/tot. load_avg arch   states
-------------------------------------------------------------------
all.q@master         BIP   0/8/8      0.06    linux-x64     
    562 0.55500 Run001     user0001     r     11/19/2014 11:59:15     1        
    563 0.55500 Run002     user0001     r     11/19/2014 11:59:20     1        
    564 0.55500 Run003     user0001     r     11/19/2014 11:59:23     1        
    565 0.55500 Run004     user0001     r     11/19/2014 12:00:15     1        
    566 0.55500 Run005     user0001     r     11/19/2014 12:00:18     1        
    567 0.55500 Run006     user0001     r     11/19/2014 12:00:26     1             
    568 0.55500 Run007     user0001     r     11/19/2014 12:01:15     1        
    569 0.55500 Run008     user0001     r     11/19/2014 12:02:15     1        
####################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS 
####################################################   
    571 0.55500  Run010     user0001     p    11/19/2014 12:01:15     1        
    572 0.55500  Run011     user0001     p    11/19/2014 12:02:15     1        
    573 0.55500  Run012     user0001     p    11/19/2014 12:03:15     1     
    574 0.55500  Run013     user0001     p    11/19/2014 12:01:15     1         
    580 0.55500  Run035     user0001     p    11/19/2014 12:03:15     1

JobID 580, as well as the remaining 4 bootstrap jobs are pending (not assigned to a machine). We want to decrease the priority of JobID's 571--574 so JobID 580 will run before the remaining pending bootstrap jobs. The priority is shown as the decimal number in the qstat output above. Currently all jobs have the same priority. We will decrease the priority of JobID's 571--574 using the qalter command within RStudio.

for(i in c(571:574)){
  system(paste('qalter -p -10 ', i, sep='')) 
 }

The value following 'p' in the above code block sets the priority. You can supply a number between -1023 and 0. Users on the system can decrease the priority of their jobs. If we were to look at qstat output again, it would look as follows.

queuename        qtype resv/used/tot. load_avg arch   states
-----------------------------------------------------------------------
all.q@master         BIP   0/8/8      0.06    linux-x64     
    562 0.55500 Run001     user0001     r     11/19/2014 11:59:15     1        
    563 0.55500 Run002     user0001     r     11/19/2014 11:59:20     1        
    564 0.55500 Run003     user0001     r     11/19/2014 11:59:23     1        
    565 0.55500 Run004     user0001     r     11/19/2014 12:00:15     1        
    566 0.55500 Run005     user0001     r     11/19/2014 12:00:18     1        
    567 0.55500 Run006     user0001     r     11/19/2014 12:00:26     1             
    568 0.55500 Run007     user0001     r     11/19/2014 12:01:15     1        
    569 0.55500 Run008     user0001     r     11/19/2014 12:02:15     1        
####################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS 
####################################################
    580 0.55500  Run035     user0001     qw    11/19/2014 12:03:15     1  
    571 0.55100  Run010     user0001     qw    11/19/2014 12:01:15     1        
    572 0.55100  Run011     user0001     qw    11/19/2014 12:02:15     1        
    573 0.55100  Run012     user0001     qw    11/19/2014 12:03:15     1     
    574 0.55100  Run013     user0001     qw    11/19/2014 12:01:15     1

The above output indicates that JobID 580 will be the next job executed when a core is available.

Troubleshooting

There are times when a nonmem job will pend in the queue and you will not know why. This tends to happen with parallel runs rather than normal runs when more nodes are requested than are available. The qstat output below demonstrates this situation.

queuename        qtype resv/used/tot. load_avg arch   states
-----------------------------------------------------------------
all.q@master        BIP   0/0/8      0.06    linux-x64           
####################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS 
################################################### #
    590 0.60383  Run100     user0001     qw   11/19/2014 12:03:15     12  
    591 0.60383  Run101     user0001     Eqw  11/19/2014 12:03:15     1

The '12' at the end of JobID 590 indicates this job needs 12 compute cores. Since there are only 8 available, it will pend indefinitely. This job should be killed using qdel and resubmitted with a correct number of cores. The 'Eqw' state of JobID 591 means Grid Scheduler failed when it tried to schedule the job. At this point, the job will not run and it should be killed using qdel. The NONR() call as well as the NONMEM control stream should be checked to see if any obvious errors exist, edited, then the job should be rerun.

On rare occasions, the qstat output will indicate that a given node is in alarm state ('a'), error state ('E'), unreachable ('U'), suspended ('S'), or disabled ('d'). Some examples of this are shown below.

queuename        qtype resv/used/tot. load_avg arch   states
--------------------------------------------------------------------
all.q@master1         BIP   0/0/3      0.06    lx24-amd64        E
queuename        qtype resv/used/tot. load_avg arch   states
-------------------------------------------------------------------
all.q@master2         BIP   0/0/3      0.06    lx24-amd64        U

In the above output, all.q@master1 is in error state and all.q@master2 is unreachable. After following the guides in the troubleshooting, if errors continue to persist, with nodes that are marked 'E', 'U', 'd', or 'S' you should contact the Metworx^TM^ help desk. If this does occur, you may need to delete the existing Workflow and restart a new one.

NOTE: Additional information on the Grid Scheduler commands can be found in the man pages for each command. These can be accessed from the system command prompt by typing 'man ', so for qstat we would type 'man qstat'.