amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

Support for executing on F1 instances, specifically to leverage DRAGEN

Open nh13 opened this issue 2 years ago • 5 comments

I know I can specify the instance types for a specific context, which are available to all tasks, but I would also like to specific tasks execute on a specific instance. For example, utilizing F1 instances to run DRAGEN. I'd like to understand for each engine, if/how agc can help route tasks that require a specific instance type or family within a workflow.

For nextflow, we have a machineType that can be specified in the config. Is this or can this be leveraged by agc?

For WDL (Cromwell and mindl), and Snakemake, how do we specify that a specific task/rule needs a specific instance type (GPU/FPGA)?

nh13 avatar Jun 10 '22 21:06 nh13

While possible in theory there are a number of things that would need to be developed to make this work.

Starting with the engines themselves. All of them are currently adapted to interact with AWS using Batch and a Batch job. AWS Batch selects instances based on a ResourceRequirements definition which currently doesn't expose a way to request FPGA resources (although this might change in the future). If it did change then the engines AWS interaction would need to change to make use of this. Without this there is no way for Batch to know that you want to place a particular Batch job on an f1 instance. If you made a compute environment which included f1 instances then you would also be running non FPGA tasks on the instance which would not be cost effective. The EC2 launch template or AMI for the compute environment would also need to include relevant FPGA libraries/ drivers.

Another possible complication with container jobs on FPGA instances is that some FPGAs are stateful and should not be shared between containers and should be reset between jobs. I'm not sure if f1 instances fit into this category but if they do then additional special job placement logic would be required.

An alternative to Batch could be AWS Parallel Cluster. Parallel Cluster can deploy and manage environments that include FPGAs. Currently AGC doesn't have the required logic to deploy and interact with a Parallel Cluster environment. This could be done. Some modification of the engines may also be required, although some do have SLURM support so they might "just work".

I don't think WDL has any formal mechanism to request FPGAs although presumably one could be added? It might require a new version though such as WDL 1.2? Snakemake I am unsure of.

markjschreiber avatar Jun 13 '22 13:06 markjschreiber

@markjschreiber Thank-you for the thoughtful response. I have used Parallel Cluster in the past, and what I am trying to avoid is having my clients have to set up and manage that infrastructure.

What if we could define multiple compute environments, such that the F1 instances could all be in one compute environment? In the simplest example, I have a single F1 instance type (e.g. 16 cores) in that environment and then I specify to use all CPUs for that rule/task/job, in effect reserving the whole instance. I already do this when I set up my own Batch compute environment. I set up compute environments quite frequently for my clients, many of who'm are not able to manage such environments, so AGC is so attractive for them, versus my own CloudFormation and then me managing it.

nh13 avatar Jun 13 '22 15:06 nh13

Yes, that could work although you would need some mechanism to indicate the ID of the job queue in the workflow task and the workflow engine would need to be able to interpret that and submit the batch job to the right queue. You'd probably want to make sure the workflow engine ensures that memory and cpu requirements are set to the correct values so that you don't get multiple containers per f1 instance.

It might also be worth increasing the number of AZs that AGC uses so that Batch has the largest pool to draw f1 instances from (as they can be over subscribed at times of high demand). If AGC constructs a VPC it will default to using 3 AZs (this is currently hard coded but simple to change). If you supply your own then you can have more.

markjschreiber avatar Jun 13 '22 16:06 markjschreiber

Check alternative https://aws-quickstart.github.io/quickstart-illumina-dragen/

hmkim87 avatar Jun 22 '22 05:06 hmkim87

Thanks @hmkim87, I have this working in Nextflow and Snakemake, but not through the AGC. The purpose of my request is not to find “A” solution, but to request support in AGC. My apologies for the confusion.

nh13 avatar Jun 22 '22 14:06 nh13