nextflow Specify Google Cloud Compute Engine disk type

New feature

Ability to specify the Compute Engine disk type (pd-standard or local-SSD) found in the new Cloud Life Sciences API (https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta#disk).

Usage scenario

Job's that require a high input/output operations per second and lower latency (https://cloud.google.com/compute/docs/disks/local-ssd).

Suggest implementation

The API documentation states it can be set using setType() (https://developers.google.com/resources/api-libraries/documentation/genomics/v1alpha2/java/latest/com/google/api/services/genomics/model/Disk.html#setType-java.lang.String-)

Add disk type during formation of VM in GoogleLifeSciencesHelper.groovy

protected Resources createResources(GoogleLifeSciencesSubmitRequest req) {
        def disk = new Disk()
        disk.setName(req.diskName)
        disk.setSizeGb(req.diskSizeGb)
        disk.setType(req.diskType)

Where req.diskType is specified in GoogleLifeSciencesTaskHandler.groovy

    req.bootDiskSizeGb = executor.config.bootDiskSize?.toGiga() as Integer
    req.diskType = task.config.getDiskType() as String
    return req

getDiskType() can be set within TaskConfig.groovy, where it is set to pd-standard by default.

    String getDiskType() {
        def value = get('diskType')

        if( !value ) return "pd-standard"

        if (value.toString()=="pd-standard" || value.toString()=="local-ssd") {
            return value.toString()
        } else {
            return "pd-standard"
        }
    }

Preliminary tests showed it was successful to generate a Computer Engine instance with SSD attached.

Jan 07 '20 14:01 twbattaglia

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 27 '20 02:04 stale[bot]

I would definitely support this. The key logic of nextflow is a little challenged on the cloud: unless one has a shared disk which can be mounted by all tasks VMs, each task will copy back and forth files to/from the bucket instead of using sym links as on-prem. This behaviour huuuugely multiplies costs by increasing both I/O and runtime. The possibility of specifying the disk type could change the IOPS of the VMs and improve performance on worker VMs. This feature would help optimizing nextflow pipelines on the cloud. quite important :)

Jun 10 '20 12:06 lescai

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 07 '20 12:11 stale[bot]

Bump

Nov 12 '20 10:11 pditommaso

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 11 '21 11:04 stale[bot]

We should be able to support this feature for both google life sciences and google batch. I think the best way to support it in Nextflow would be to add a DiskResource class so that the disk type can be specified in the disk directive, like with accelerator. I have laid the groundwork for this in #3027, so when we merge that PR then I can implement the disk type.

Jul 21 '22 22:07 bentsherman

+1, would be very useful for tasks like fasterq-dump

Mar 17 '23 13:03 Puumanamana

Google Batch does support SDD disk when using Fusion file system. See here

https://www.nextflow.io/docs/edge/google.html#fusion-file-system

Mar 17 '23 14:03 pditommaso

Support for disk type was added to Google Batch in #3861 . We aren't really adding new features to the google-lifesciences executor because we encourage users to migrate to Google Batch, so I'm going to close this issue.

May 31 '23 16:05 bentsherman

nextflow nextflow copied to clipboard

Specify Google Cloud Compute Engine disk type

New feature

Usage scenario

Suggest implementation

nextflow
nextflow copied to clipboard