dsub
dsub copied to clipboard
Add support to attach read only disk
I'm trying to run multiple pipelines that all read the same data (~20TB) . Copying this to all containers seems unreasonable and the best way I could think off is to put this on a shared read-only PD.
Is there a way to attach a readonly data disk?
Hi @hnawar.
There is no support presently for attaching a disk read-only, although that is a very reasonable request and something that can be done now with the google-v2
provider (this was not possible with the google
provider).
For now I would suggest experimenting with putting the resources into a GCS bucket and mounting the bucket with gcsfuse. The --mount
parameter was just added with the most recent release, 0.2.1.
How to use is here:
https://github.com/DataBiosphere/dsub/blob/c968683f309577ca86fdd0c05fdf618a938c6088/README.md#mounting-buckets
We tried GCSFuse but the performance was underwhelming. It would be interesting to add the option to specify the name of a read only disc and path to mount
Thanks @hnawar. Looking deeper into the Pipelines v2 disk support, I'm not sure that this is in fact supported.
https://cloud.google.com/genomics/reference/rest/Shared.Types/Disk https://cloud.google.com/genomics/reference/rest/Shared.Types/Action#Mount
Implies the ability to mount disks read only into an action, but the Disk resource does not appear to support a way to mount an existing PD to the VM first.
Will verify this with the Cloud Health team.
I checked with the Cloud Health team and the recommended approach here is to create a Compute Engine Image and create the disk from that image.
Next step will be to wire through a new --mount
option. I think it would look something like:
--mount RESOURCES=https://www.googleapis.com/compute/v1/<image-path>
As an example, using one of the public ubuntu images:
--mount RESOURCES=https://www.googleapis.com/compute/v1/projects/eip-images/global/images/ubuntu-1404-lts-drawfork-v20181102
we would key off of https://www.googleapis.com/compute
to detect the request to mount a GCE image in the same way that we key off of gs://
to detect mounting a GCS bucket. Implicit here is that we would request creation of a new disk which would be mounted readOnly into the user-action container.
Experimental support for mounting a PD built from a Compute Engine Image has been added in release 0.2.4, specifically with change https://github.com/DataBiosphere/dsub/pull/139/commits/0c4a93a59dc5e00100e1e4edae761ee7e761bddd.
Let us know how this goes.
Thanks, Just resumed working on this after the break. Just want to add a key factor is the cost, the same job will run for 1000s of data points each will need to read the same 20TB , this means for every job a 20TB PDD will be created this will will limit the number of jobs running in parallel due to the PDD quota and also add significant cost to the whole process
I came across this issue when I did a little searching. I'm trying to run Alphafold with dsub but the 3TB database disk is increasing the cost by quite a lot. I'm wondering if mounting read-only and sharing across multiple instances would be happened through in the near future or not ? My dsub command right now is like this,
dsub --provider google-cls-v2 \
--project ${PROJECT_ID} \
--logging gs://$BUCKET/logs \
--image=$IMAGE \
--script=alphafold.sh \
--mount DB="${IMAGE_URL} 3000" \
--machine-type n1-standard-16 \
--boot-disk-size 100 \
--subnetwork ${SUBNET_NAME} \
--accelerator-type nvidia-tesla-k80 \
--accelerator-count 2 \
--preemptible \
--zones ${ZONE_NAMES} \
--tasks batch_tasks.tsv 9
I tried to put a disk uri, https://www.googleapis.com/compute/v1/projects/my-project/zones/us-west1-b/disks/my-disk
, to the IMAGE_URL
but it didn't work. It said it's not a supported resources in the log.
And speed matters, similar to @hnawar , so mounting a bucket won't be an option.
Hi @michaeleekk !
The way to do this (as is supported by the Lifescienes API) is to create a disk with the resources file(s) and then create a GCE Image from that disk.
In dsub
, you can then have a new disk created from the Image as described here:
https://github.com/DataBiosphere/dsub#mounting-resource-data
Creating Disks from Images should be much faster than pulling all of the data from GCS a file at a time. Please give this a try and let us know how it goes.
@mbookman Thanks for the reply.
I tried that method. But in this way, each of the instance will have a individual 3TB disk attached to the instance. The cost of each run will become expensive.
That's why I was asking if there is a way to share a disk between multiple instances, like mentioned in this page.
That is supported by the Life Sciences API by using Volume instead of Disk https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta?hl=en#google.cloud.lifesciences.v2beta.ExistingDisk But it is not supported by dsub at the moment.
I'm not very familiar with the code, but it's worth evaluating how much effort is needed to switch
I just had a peek and, I guess I might be wrong but, having another mount type around here like ExistingDiskMountParam
plus handling this new class and parse the uri more properly might work. dsub seems to be a wrapper around the LifeSciences API, so it's theoretically workable I guess.
Thanks for the pointer @hnawar ! I had not seen the ExistingDisk support added (late 2020).
We'll look at extending the --mount
flag to take advantage of the capability.
Release https://github.com/DataBiosphere/dsub/releases/tag/v0.4.7 adds support for the ExistingDisk by extending the URL formats recognized with the --mount
flag.
Heres the change: https://github.com/DataBiosphere/dsub/commit/2d0b808def65bc6100e4da81d9f82e241bbfb8c9
Please take a look and let us know how it goes for you. This has (obviously) been a long standing feature request and we're pretty interested to hear about how much computational time you find it saves you.