dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Feature]: Enable all available network cards on AWS instances

Open un-def opened this issue 1 year ago • 2 comments

Problem

Currently, with the AWS backend dstack unconditionally requests one network interface, even with instance types that have multiple network cards (e.g., p5.48xlarge has 32 EFA-capable cards). Network performance is crucial for distributed HPC workloads, thus a single network interface may be a bottleneck.

Solution

Enable all available interfaces by default.

Workaround

No response

Would you like to help us implement this feature by sending a PR?

No

un-def avatar Oct 09 '24 12:10 un-def

Just random questions/thoughts:

  • I assume the preferred approach would be to have EFAs created/deleted alongside the associated node, correct?
  • AFAIK, only a single EFA can be attached upon EC2 creation, and the rest can be attached only by stopping the node. A more viable way to attach multiple EFAs right away could be via EC2 Launch Templates, but this might require re-working EC2 instance and fleet schemes. What do you think? Kinda makes sense for multi-node cluster placement groups.
  • Each EFA requires an IP address, and given that some instances take up to 32 EFAs, they can quickly exhaust the address pool. Thus, it might be a good idea keep EFA attachments off by default and provide a way to enable exhaustive EFA attachment via the fleet config or somehow else.

timsolovev avatar Oct 09 '24 22:10 timsolovev

Sorry for the delayed reply.

I assume the preferred approach would be to have EFAs created/deleted alongside the associated node

Yes, ideally lifetime of resources should be bound to the parent resource lifetime.

AFAIK, only a single EFA can be attached upon EC2 creation

According to this snippet, it's possible to request multiple EFA interfaces via RunInstances, but it needs to be verified.

If it's true, the only limitation seems to be that we cannot use associatePublicIPAddress: true with multiple network interfaces via RunInstances method.

can be attached only by stopping the node

Didn't know about this limitation, but you are right, we have to stop the instance first to attach interfaces.

A more viable way to attach multiple EFAs right away could be via EC2 Launch Templates

Not sure if it doesn't have the same limitation with associatePublicIPAddress, needs to be checked, and, as you noted, migrating to templates would require additional re-working. If it's possible to work around the associatePublicIPAddress limitation with the current AWSCompute.create_instance() implementation, I think it would be a preferred way, even if a bit hacky (i.e. create an instance → stop → attach interfaces → start).

Each EFA requires an IP address

As far as I understand, only primary interface is required to have a public IP address to make in possible to connect to the instance, all other interfaces are only used for node-to-node connectivity within a private network.

As for private IPs, according to the AWS docs, “The allowed IPv4 CIDR block size for a subnet is between a /28 netmask and /16 netmask” with “[d]efault subnets within a default VPC are assigned /20 netblocks within the VPC CIDR range. ”

un-def avatar Oct 18 '24 11:10 un-def

Confirming what you've mentioned above:

  • The snippet is legit, everything is fine and dandy for as long as one of the interfaces does not have AssociatePublicIpAddress=true.
  • From my perspective, it is more straightforward to create instance first, then network interfaces, then attachments. Elastic IP is not really sustainable, given super tight default quota per account.
  • VPC subnets sizes are not always under our control, and sometimes we can be limited within a /24 or even /25 CIDR, which certainly limits the number of ENA+EFA interfaces that can be provisioned. No cap on EFA-only interfaces as they do not require IP addresses.

I'll create a PR with a possible approach to this.

timsolovev avatar Nov 13 '24 20:11 timsolovev

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Dec 14 '24 02:12 github-actions[bot]

#2270 enables max number of EFA interfaces for non-public instances. Closing.

r4victor avatar Feb 10 '25 09:02 r4victor