This page demonstrates several ways to monitor jobs that are running in parallel, either on your master node or on compute nodes in the SGE grid.
Monitoring Locally on the Master Node
When running jobs in parallel on your master node, the easiest way to monitor if they're running (and using the expected number of cores) is to use the Linux utility
htop. In your terminal (not your R console) type
htop. This brings up the following visual:
The horizontal bar graph at the top tells you (among other things) about the cores on your machine and their utilization capacity. There is a bar for each core (each vCPU actually) and they will be lit up green to show percent utilization. The table below it shows all of your currently running processes. You might be amazed to see how many there are at any given time, and how little CPU most of them use. This is essentially the same thing you see in
top, another common Linux utility for monitoring processes.
htop after submitting eight NONMEM models, each using four cores. This is the first example under "Submitting Jobs to the Grid" in our "Grid Computing Intro". Notice all cores are lit up green because we have submitted more than enough work to use all available CPU resources.
You can quit at any time with
F10 and return to your terminal.
Monitoring on the Grid
If you're submitting jobs to the grid, you won't see them in
htop only monitors the master node.
However, there are several other tools for monitoring work being done on the grid. The following examples show monitoring from submitting eight NONMEM models to run on the grid using four cores for each model. This is the second example under "Submitting Jobs to the Grid" section in our Grid Computing Intro.
Starting with the
21.08 release, Metworx is configured to display usage information in a Grafana dashboard. You can access this by adding
grafana to the end of your workflow URL, such as
https://i-0972bb95cd8cb78ec.metworx.com/grafana. You can find more information on the Getting Started With Grafana page.
The top of the dashboard shows high-level stats about utilization across the entire grid.
Here we see that we have 32 available slots (or cores, see Grid terminology). This is because we have four worker nodes up, each of which has eight cores. Furthermore, you can see they are all in use since we submitted eight models and specified each to use four cores (
8 * 4 = 32).
Below that, the dashboard shows CPU and memory usage for each worker node. Notice how each worker node has a small peak to the right-hand side, which shows they each have started working on a task (the job we submitted).
Below is ten minutes later when all the jobs have finished. You can see that CPU utilization on each node went to 100% for roughly seven minutes while the models were running. You can also see the "used memory" line creep up as the models run, though never coming close to using all the memory.
If you had looked at
htop during this period you would not see any activity, because all processing was done on the worker nodes and
htop, as previously stated, only shows activity on the master node.
You can also use the
qstat command (also used in the terminal, not the R console) to show you the jobs currently queued or running on the grid. (The
-f flag shows a more informative output.)
qstat -f right after the jobs were submitted, before any worker nodes have come up to process them:
And here is
qstat -f showing all eight jobs being processed simultaneously on four 8-core worker nodes, as shown in the diagram at the end of the "Submitting Jobs to the Grid" section of our Grid Computing Intro.