eksctl
eksctl copied to clipboard
[Bug] Cannot create cluster with p5 and efa enabled.
What were you trying to accomplish?
Deploy a p5 cluster with efaEnabled: true
What happened?
Node group creation failed
Resource handler returned message: "The specified number '2' for the device index exceeds the maximum number of network interfaces supported by the network card at network card index 2. The maximum devices for this network card is 2. Specify a number from 0 to 1, and try again. (Service: Eks, Status Code: 400, Request ID: XXX)" (RequestToken: XXX, HandlerErrorCode: InvalidRequest)
How to reproduce it?
Cluster deployment template
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: p5-cluster
version: "1.28"
region: us-east-1
# Fully-managed nodegroups
managedNodeGroups:
# Nodegroup for system pods
- name: sys
instanceType: c5.2xlarge
desiredCapacity: 1
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
- name: p5-odcr-vpc
instanceType: p5.48xlarge
instancePrefix: p5-odcr-vpc
privateNetworking: true
efaEnabled: true
minSize: 0
desiredCapacity: 8
maxSize: 16
volumeSize: 500
subnets:
- XXX
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ebs: true
fsx: true
iam:
withOIDC: true
Logs
Anything else we need to know? I believe this is due to https://github.com/eksctl-io/eksctl/blob/b39bb66db84607b4e38339a0d22ea554a69150cc/pkg/cfn/builder/network_interfaces.go#L67
That assign 1 network card per device index. However, p5 only use device 0 and with network card 0 and device index 1 for the remaining 31 network cards, https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-instances-started.html
P5 instances deliver 3200 Gbps of networking bandwidth by using multiple EFA interfaces. P5 instances support 32 network cards. We recommend that you define a single EFA network interface per network card. To configure these interfaces at launch we recommend the following settings:
For network interface 0, specify device index 0
For network interfaces 1 through 31, specify device index 1
Versions
eksctl version: 0.169.0-dev+1c8cc6244.2024-01-24T01:15:25Z
kubectl version: v1.27.2
OS: darwin
++
I can confirm that simply hardcoding the second argument to defaultNetworkInterface()
as 1
creates the p5 cluster successfully
diff --git a/pkg/cfn/builder/network_interfaces.go b/pkg/cfn/builder/network_interfaces.go
index 103811ce5..2da89dd23 100644
--- a/pkg/cfn/builder/network_interfaces.go
+++ b/pkg/cfn/builder/network_interfaces.go
@@ -64,7 +64,7 @@ func buildNetworkInterfaces(
// Due to ASG incompatibilities, we create each network card
// with its own device
for i := 1; i < int(numEFAs); i++ {
- ni := defaultNetworkInterface(securityGroups, i, i)
+ ni := defaultNetworkInterface(securityGroups, 1, i)
ni.InterfaceType = gfnt.NewString("efa")
nis = append(nis, ni)
}
Obviously some extra logic would be required so this is only applied to p5 nodes.