eksctl
eksctl copied to clipboard
Add/hpc7g node arm support
Description
Problem: The new hpc7g images use the graviton2 processor (arm) but are not detected as such by eskctl. In addition, as we have been discussing in https://github.com/weaveworks/eksctl/issues/6222, the daemon set for the efa device driver does not work with runAsNonRoot set to true. I believe this tweak is close, however the final step (I think) needs to also be to provide an ARM build for the driver itself. I believe this is proprietary code, so I wanted to ask here first about that. I was able to figure out the container entrypoint and output, in case that helps:
$ /usr/bin/efa-k8s-device-plugin
2023/06/27 00:17:39 Fetching EFA devices.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6
2023/06/27 00:17:39 EFA Device list: [{rdmap0s6 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s6}]
2023/06/27 00:17:39 Starting FS watcher.
2023/06/27 00:17:39 Starting OS watcher.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6
2023/06/27 00:17:39 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2023/06/27 00:17:39 Registered device plugin with Kubelet
Note that online examples for efa (e.g., this repository) is not exactly what we want - the Dockerfile will build a container that can use EFA but not one that has that particular executable.
Let me know how you would like to proceed!
This is working now and can be reviewed. I changed nothing, and I have no idea why it's working. I think it might be related to the aws metadata (describe-instances) that sets conditions for the efa / network devices. If it wasn't completed/ready on the first tries, maybe that could lead to this outcome?
Why was this closed?
Why was this closed?
@vsoch, apologies again, it was closed by the stale bot. I have reopened it now and applied a label that should prevent it from being automatically closed. Please give us some time to review it, the team is occupied with other deliverables.
Thank you!
Please re-open again, thank you!
Hi @cPu1 could you please give feedback to the CI errors? I'm seeing them show up in other PRs and it looks to be that an incorrect function signature is being used, for example:
GetOutpostInstanceTypes(context.backgroundCtx,*outposts.GetOutpostInstanceTypesInput,func(*outposts.Options))
0: context.backgroundCtx{emptyCtx:context.emptyCtx{}}
1: &outposts.GetOutpostInstanceTypesInput{OutpostId:(*string)(0xc0000542c0), MaxResults:(*int32)(nil), NextToken:(*string)(nil), noSmithyDocumentSerde:document.NoSerde{}}
2: (func(*outposts.Options))(0x7ec320)
The closest call I have is:
GetOutpostInstanceTypes(string,string)
0: "mock.Anything"
1: "mock.Anything"
and notably we don't touch the relevant code here. Is there another PR that is fixing these CI issues we should watch? Thanks!
@vsoch could you please rebase with main?
All set!