aws-eda-slurm-cluster
aws-eda-slurm-cluster copied to clipboard
AWS Slurm Cluster for EDA Workloads
**Is your feature request related to a problem? Please describe.** Currently, users specify core and memory requirements for jobs so that Slurm can pick best compute node instance type for...
**Is your feature request related to a problem? Please describe.** When the compute node AMI needs to be updated, what affect does that have on running jobs? Can it be...
**Is your feature request related to a problem? Please describe.** When multiple AZs are configured make sure that AZ-specific queues are created.
**Is your feature request related to a problem? Please describe.** When errors occur in head node or compute node custom action scripts the configured SNS notification should be notified like...
**Is your feature request related to a problem? Please describe.** This is an example of a node definition from ParallelCluster: ``` NodeName=od-16-gb-dy-od-16gb-1-cores-[1-1000] CPUs=1 RealMemory=15564 State=CLOUD Feature=dynamic,od-16gb-1-cores Weight=1363 NodeName=od-128-gb-dy-od-128gb-2-cores-[1-1000] CPUs=2 RealMemory=124518...
**Is your feature request related to a problem? Please describe.** Currently ParallelCluster only supports 50 compute resources and 50 queues. With memory based scheduling enabled you can only have 1...
**Is your feature request related to a problem? Please describe.** The ParallelCluster database stack currently uses static nodes instead of RDS serverless. Unclear if this will scale with cluster usage...
**Is your feature request related to a problem? Please describe.** The legacy version supported compute nodes in multiple AZs and regions. I don't think that orchestrating compute nodes in multiple...
**Is your feature request related to a problem? Please describe.** Slurm support multiple controllers for HA. Add support for multiple controllers with each in separate AZs. **Describe the solution you'd...
**Is your feature request related to a problem? Please describe.** Newly started ParallelCluster compute nodes take at least 4-5 minutes to boot and start. This should be reduced to little...