What is a workflow?


Fundamentally, each workflow you launch offers the capabilities of an entire High performance computing cluster. Though in many companies, such clusters are shared by many people, each metworx user gets their own "private" instance.

Metworx workflows are labelled via a naming convention called "calendar versioning" - that is, <two digit year>.<month> for the initial release. So a workflow labelled 20.12 was originally built in December of 2020, one called 20.06 was in June of 2020. This allows for an approximation of what to expect software-wise on the workflow based on its age. Workflows generally have the latest version(s) of all software when released as well as, when possible, additional legacy versions for backwards compatability.

Each workflow can be broken out into three key elements:

  • master/head node
  • compute node(s)
  • the disk /data where data can be shared and will persist even when a workflow is shut down.

This diagramatically looks approximately like:

workflow setup

When you launch a workflow the only required long-term part of the infrastructure is the master node which will remain on until you delete the workflow.

The compute nodes will autoscale whenever you submit jobs to the grid. Metworx uses the Grid Engine (SGE) to manage jobs submitted to the compute nodes. This is discussed in more detail in working in metworx. This allows you to get access to the compute resource you need, but only when you need them, thereby allowing for significant cost savings.

When compute nodes scale up due to jobs being submitted to the grid, the number will be based on the number of jobs you submit as well as the number of available slots each compute node has. A slot corresponds to a vCPU (virtual compute core). Default jobs will use one slot (thus one core), however this can be tuned if you have a job that needs multiple cores for parallel processing.

Where is software running?

On the head node, rstudio server and guacamole are the two most common ways to log in. These correspond to the icons on the dashboard

workflow dashboard

As outlined in working in metworx rstudio server is a fantastic entrypoint for many programmatic or command line activities, even if you are not an R user specifically.

The (guacamole) Desktop allows a remote desktop without needing any software on your computer beyond a web-browser. This is where software such as matlab, monolix, firefox, and a graphical desktop interface can be accessed.

This means that all this software is going to be driven by the selection of the compute and ram selection for the master node.

On the other hand, when submitting commandline jobs to the grid such as NONMEM jobs or Rscripts run in batch, these jobs will instead run on the compute nodes, thus keeping your master node free from being overwhelmed if much work is being performed. Recommendations on how to select different sizes is available at right sizing workflows.

The remainder of the documnets in the getting started section will get you more acquainted with the specific capabilities and functionality. Have fun!