FAQ


These represent questions that have arisen during metworx training or support that are relevant to the general metworx system.

Questions

Question: can only one disk be associated with one workflow?

Correct. That disk will be local to that workflow, it is an attached drive to that workflow. So, you can not mount to multiple workflows at once.

One workaround (DANGER): you can make a new disk from a snapshot, so if there is an instance where you need to mount the same disk essentially multiple times in different workflows, you could: NOTE: we do not recommend this workaround, it should really only be used if it's determined to be necessary for a project.

  • Shut down the workflow
  • Figure out the disk in question's id
  • Then, create a new disk from that snapshot
  • After creating new disk from Snapshot, if you went to create a new workflow, you would see the disk you just created if you choose to create a workflow using an existing disk.

Question: what is the ideal disk space?

We recommend a default of 100 GB disk space. Disk space is so inexpensive (pennies per gigabyte per month). So with things like NONMEM, it does not take up much space per run, but if you run a bootstrap where you're running 1000 models and if each model takes up 10 MB of space, then all of a sudden your model has taken up 10 GB.

In Metworx, one thing you get on the dashboard where you can go to create new workflows (or spin up existing workflows), you will also get the disk utilization for that workflow.

Another way to check resource utilization is, if within your workflow you go to RStudio and change your workflow url to end in /admin (removing everything following /rstudio in the workflow url) then you will get a dashboard of other information about your workflow as well, like memory usage, ongoing CPU usage, etc.

  • Go to RStudio in active workflow, eg https://i-39asdf9.metworx.com/rstudio/s/asdf
  • Change workflow url to go to rstudio/admin https://i-39asdf9.metworx.com/rstudio/admin
  • View RStudio dashboard with workflow information

rstudio metrics

From an IT/admin perspective, this type of information is also available via cloudwatch, where you can get much more sophisticated information that what would be available to general users reviewing their own workflows.

Question: what if 2 worker nodes are writing to the same file, what happens? What if there is some conflict, what happens?

This will act like any network file share. In particular, two compute nodes can absolutely write to the same file at the same time and if set to ‘append only’, then the act of appending is something that the network file share protocol (nsfv protocol) handles pretty well.

However, just like if you ran into the same problem on your master node, if you had 2 analyses running at the same time, trying to save out to the same file, one will overwrite the other.

From a recommendation perspective: if you’re running into a situation where you’re trying to run parallelized analyses where each process that’s running should be writing out to a file, you should use some additional file prefix or suffix to delineate between them, then subsequently you can aggregate those to a single result/file.

i.e. if you have 4 processes running, rather than writing everything out to something like results.csv, you can write out to resultsp1, resultsp2, etc. and then after they’re done, you could aggregate and save out a final results file.

Question: what is the limit to the number of concurrent file writes that could be happening on /data? For example, what if a user has 10,000 compute nodes that may be writing independent files but are all trying to write to /data simultaneously.

Believe settings were bumped up to 99,000 or 100,000. Should be pretty high so users can stream out a pretty significant amount of data.

Example: for those of you who use NONMEM, you know it produces a large number of temp files, writing a lot to disk. There have been times where MRG’s internal science group has spun up a 9600 core workflow that was running 9600 models at a time and each of those models was writing our 10-20 files at a time with no problem on a 2-core master node.

Question: understanding impact of different disk and file sizes.

The larger the disk, incrementally the more throughput you’ll get. So, a 100 GB disk vs a 1 TB disk, the 1 TB disk is not just 10x bigger, it also is faster.

When it comes to files sizes and interactions, the disk needing to read 100 MBs from 1 file vs 10 MBs from 10 files, that’s at an operating system level, reading from multiple smaller files is slower than reading from one large file because you’re needing to hit non-contiguous locations and needing to stream them in, rather than being able to leverage some efficiencies that the operating system can handle.

Question: after we initially create a disk, can we increase the size? And will all data get carried over after disk size is changed.

  • Yes, absolutely. When you have a previously created disk, when if you go to launch a new workflow, you can select the previously created disk and change the disk size. Everything that was on that disk will now be available in your new workflow. Below are the general steps to do so:
  • Select previously created disk when spinning up a new workflow

choose disk

  • See the original disk size

previous disk size

  • Enter desired disk size for new workflow

resize-disk

Question: how to delete disks that are no longer being used?

Steps for Metworx users to delete disks are as follows (admins can go in and delete any disks within the organization):

  • On the Metworx dashboard, click on your user profile, click "My Disks"

my disks

  • Then, you will see all of your available disks. Any disk that is not currently mounted, you can delete.

available disks

Question: if I have a file in my /tmp file on a compute node, if I stop my workflow, will that be erased?

Yes, that will be erased. If you were to close your workflow, then open a new one, that file would no longer exist in /tmp.

/tmp and everything else is one disk, the local disk to that compute node. The compute node also has, from the master node, a network file mount to /data. So, essentially the master node is acting as a file share for all of the compute nodes via /data.

Anything you put in /data in your master node will become instantly available in your compute nodes as well via the /data file mount.

Question: I have a 1 GB in /data on my master node. If I then have 2 compute nodes, will the file be shared across the 2 compute nodes evenly (so 500 MB on both?).

No, what will happen is, that file will fully available on both compute nodes at the same time (because /data is a network file share).

So, if you have a script that needs to read a file from /data on a compute node, that script will be able to access the entirety of that data file.

If you need/want to do distributed processing across multiple compute nodes, i.e. the first 500 MB on one compute node, the second 500 MB on a second, then that is an exercise of writing your scripts such that they account for this distribution. Some ways this could happen:

On the first compute node, the script reads in the entire file from /data then discards the second half of it.

On the second compute node, the script reads in the entire file from /data then discards the first half of it.