eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

Add/hpc7g node arm support

Open vsoch opened this issue 2 years ago • 6 comments

Description

Problem: The new hpc7g images use the graviton2 processor (arm) but are not detected as such by eskctl. In addition, as we have been discussing in https://github.com/weaveworks/eksctl/issues/6222, the daemon set for the efa device driver does not work with runAsNonRoot set to true. I believe this tweak is close, however the final step (I think) needs to also be to provide an ARM build for the driver itself. I believe this is proprietary code, so I wanted to ask here first about that. I was able to figure out the container entrypoint and output, in case that helps:

$ /usr/bin/efa-k8s-device-plugin 
2023/06/27 00:17:39 Fetching EFA devices.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6

2023/06/27 00:17:39 EFA Device list: [{rdmap0s6 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s6}]
2023/06/27 00:17:39 Starting FS watcher.
2023/06/27 00:17:39 Starting OS watcher.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6

2023/06/27 00:17:39 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2023/06/27 00:17:39 Registered device plugin with Kubelet

Note that online examples for efa (e.g., this repository) is not exactly what we want - the Dockerfile will build a container that can use EFA but not one that has that particular executable.

Let me know how you would like to proceed!

vsoch avatar Jun 28 '23 18:06 vsoch

This is working now and can be reviewed. I changed nothing, and I have no idea why it's working. I think it might be related to the aws metadata (describe-instances) that sets conditions for the efa / network devices. If it wasn't completed/ready on the first tries, maybe that could lead to this outcome?

vsoch avatar Jul 11 '23 06:07 vsoch

Why was this closed?

vsoch avatar Dec 03 '23 01:12 vsoch

Why was this closed?

@vsoch, apologies again, it was closed by the stale bot. I have reopened it now and applied a label that should prevent it from being automatically closed. Please give us some time to review it, the team is occupied with other deliverables.

cPu1 avatar Dec 04 '23 06:12 cPu1

Thank you!

vsoch avatar Dec 04 '23 07:12 vsoch

Please re-open again, thank you!

vsoch avatar Jan 09 '24 02:01 vsoch

Hi @cPu1 could you please give feedback to the CI errors? I'm seeing them show up in other PRs and it looks to be that an incorrect function signature is being used, for example:

  GetOutpostInstanceTypes(context.backgroundCtx,*outposts.GetOutpostInstanceTypesInput,func(*outposts.Options))
  		0: context.backgroundCtx{emptyCtx:context.emptyCtx{}}
  		1: &outposts.GetOutpostInstanceTypesInput{OutpostId:(*string)(0xc0000542c0), MaxResults:(*int32)(nil), NextToken:(*string)(nil), noSmithyDocumentSerde:document.NoSerde{}}
  		2: (func(*outposts.Options))(0x7ec320)

  The closest call I have is: 

  GetOutpostInstanceTypes(string,string)
  		0: "mock.Anything"
  		1: "mock.Anything"

and notably we don't touch the relevant code here. Is there another PR that is fixing these CI issues we should watch? Thanks!

vsoch avatar Jun 19 '24 21:06 vsoch

@vsoch could you please rebase with main?

cPu1 avatar Jul 04 '24 18:07 cPu1

All set!

vsoch avatar Jul 04 '24 19:07 vsoch