nextflow
nextflow copied to clipboard
Specify Google Cloud Compute Engine disk type
New feature
Ability to specify the Compute Engine disk type (pd-standard or local-SSD) found in the new Cloud Life Sciences API (https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta#disk).
Usage scenario
Job's that require a high input/output operations per second and lower latency (https://cloud.google.com/compute/docs/disks/local-ssd).
Suggest implementation
The API documentation states it can be set using setType()
(https://developers.google.com/resources/api-libraries/documentation/genomics/v1alpha2/java/latest/com/google/api/services/genomics/model/Disk.html#setType-java.lang.String-)
Add disk type during formation of VM in GoogleLifeSciencesHelper.groovy
protected Resources createResources(GoogleLifeSciencesSubmitRequest req) {
def disk = new Disk()
disk.setName(req.diskName)
disk.setSizeGb(req.diskSizeGb)
disk.setType(req.diskType)
Where req.diskType
is specified in GoogleLifeSciencesTaskHandler.groovy
req.bootDiskSizeGb = executor.config.bootDiskSize?.toGiga() as Integer
req.diskType = task.config.getDiskType() as String
return req
getDiskType()
can be set within TaskConfig.groovy, where it is set to pd-standard
by default.
String getDiskType() {
def value = get('diskType')
if( !value ) return "pd-standard"
if (value.toString()=="pd-standard" || value.toString()=="local-ssd") {
return value.toString()
} else {
return "pd-standard"
}
}
Preliminary tests showed it was successful to generate a Computer Engine instance with SSD attached.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I would definitely support this. The key logic of nextflow is a little challenged on the cloud: unless one has a shared disk which can be mounted by all tasks VMs, each task will copy back and forth files to/from the bucket instead of using sym links as on-prem. This behaviour huuuugely multiplies costs by increasing both I/O and runtime. The possibility of specifying the disk type could change the IOPS of the VMs and improve performance on worker VMs. This feature would help optimizing nextflow pipelines on the cloud. quite important :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Bump
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We should be able to support this feature for both google life sciences and google batch. I think the best way to support it in Nextflow would be to add a DiskResource
class so that the disk type can be specified in the disk
directive, like with accelerator
. I have laid the groundwork for this in #3027, so when we merge that PR then I can implement the disk type.
+1, would be very useful for tasks like fasterq-dump
Google Batch does support SDD disk when using Fusion file system. See here
https://www.nextflow.io/docs/edge/google.html#fusion-file-system
Support for disk type was added to Google Batch in #3861 . We aren't really adding new features to the google-lifesciences executor because we encourage users to migrate to Google Batch, so I'm going to close this issue.