eks-anywhere Boot hang after "random: crng init done" at worker node(ProLiant XL675d Gen10 Plus w/AMD EPYC 7543)

trafficstars

What happened: Hanging after random: crng init done" at worker node(ProLiant XL675d Gen10 Plus w/AMD EPYC 7543) Screenshot

What you expected to happen: boot normally

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: Server : ProLiant XL675d Gen10 Plus w/AMD EPYC 7543

Environment:

EKS Anywhere Release: 0.10.1
EKS Distro Release: 1.22-9

Aug 09 '22 09:08 SeungyeopShin

Hi @SeungyeopShin, what network drivers are present on this server? Are you using RAID? Any specific storage controllers/drivers? It would be useful if you are able to provide your hardware specifications. Thank you.

Aug 10 '22 16:08 ptrivedi

Hi @ptrivedi, The hardware information is below.

Network
- Marvell FastLinQ 41000 Series - 2P 10GbE 10GBASE-T QL41132HLRJ-HC MD2 Adapter - NIC
- Intel(R) Ethernet Server Adapter I350-T4(Unused)
RAID
- Controller : HPE Smart Array P408i-a SR Gen10
- OS : RAID 1
- Data : Non RAID
GPU
- Baseboard : NVIDIA HGX A100 8 GPU 80

Thanks.

Aug 11 '22 00:08 SeungyeopShin

Hey there @SeungyeopShin , we're unable to access the linked screenshot.

Also, if you could provide a sanitized copy of your cluster configuration and the support bundle generated by the EKS A CLI it would aid in debugging. Thank you!

Aug 16 '22 15:08 danbudris

Hi, @danbudris . The screen shot is here Hanging is appeared only in GPU(HPE) server. And my cluster configuration is as below.(xxx means masking)

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: xxx-eksa
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 3 
    endpoint:
      host: "10.255.xx.xx"
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: xxx-eksa-cp
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: xxx-eksa
  kubernetesVersion: "1.22"
  managementCluster:
    name: xxx-eksa
  workerNodeGroupConfigurations:
    - count: 2
      machineGroupRef:
        kind: TinkerbellMachineConfig
        name: xxx-eksa-worker-default
      name: worker-default
    - count: 1
      machineGroupRef:
        kind: TinkerbellMachineConfig
        name: xxx-eksa-worker-gpu
      name: worker-gpu
  # IAM Authenticator
  identityProviderRefs:
    - kind: AWSIamConfig
      name: aws-iam-auth-config
  podIamConfig:
    serviceAccountIssuer: https://xxx.xxx.xxx

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: AWSIamConfig
metadata:
  name: aws-iam-auth-config
spec:
  awsRegion: "ap-northeast-2"
  backendMode:
    - "EKSConfigMap"
  mapRoles:
    - roleARN: xxx
      username: xxx
      groups:
        - system:masters
  partition: "aws"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: xxx-eksa
spec:
  tinkerbellIP: "10.255.xx.xx"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: xxx-eksa-cp
spec:
  hardwareSelector:
    type: cp
  osFamily: ubuntu
  templateRef: {}
  users:
    - name: xxx
      sshAuthorizedKeys:
        - ssh-rsa AAAA...

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: xxx-eksa-worker-default
spec:
  hardwareSelector:
    type: worker-default
  osFamily: ubuntu
  templateRef: {}
  users:
    - name: xxx
      sshAuthorizedKeys:
        - ssh-rsa AAAA...

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: xxx-eksa-worker-gpu
spec:
  hardwareSelector:
    type: worker-gpu
  osFamily: ubuntu
  templateRef: {}
  users:
    - name: xxx
      sshAuthorizedKeys:
        - ssh-rsa AAAA...

---

Aug 17 '22 00:08 SeungyeopShin

Can you confirm if your other control and worker nodes were provisioned successfully and it just only the GPU node/s that are stuck in that stage?

Another thing to mention is that AWS IAM Auth isn't supported for Bare Metal yet but the support will be added in the upcoming 0.11 release

Aug 17 '22 22:08 abhinavmpandey08

Hi @abhinavmpandey08. Yes, not GPU nodes are successfully provisioned and working well. (Not GPU nodes are not HPE servers and vm machine on XCP-ng.)

I known that AWS IAM Auth is not working now. So I manually set up the IAM Auth and IAM for Pod and that is working.

Aug 17 '22 23:08 SeungyeopShin

Hi @SeungyeopShin,

When attempting to get provisioned, these servers will first try and boot into the Hook OS. Hook is a minimal, small, in-memory, LinuxKit based OS. This OS works as a bootstrapping OS to start the full provisioning of the servers. We do not have the Marvell/QLogic qed/e drivers on the Hook OS version that we vend and publish with EKS Anywhere GA.

Here are the types of hardware on which we have validated EKS Anywhere.

For other types of wide varieties of hardware that people may have, if Hook OS does not contain the necessary drivers, we recommend you attempt to build your own Hook kernel. We have documented the detailed steps to accomplish that here

Once you have the kernel and initramfs built, you can host them at a location your cluster servers will be able to access via http(s) and specify that custom location via the clusterconfig as described here

Please let us know if you need further help. If you really get stuck, we can try and build the custom Hook OS for you.

Aug 19 '22 18:08 ptrivedi

Hi @ptrivedi , OK I'll try to make own Hook OS. Thanks.

Aug 20 '22 01:08 SeungyeopShin

Hi @ptrivedi

I tried to follow the guide you gave me. I added marvell and qlogic driver added in menuconfig.

When I boot using customized hook OS image, nothing happens after "Welcome Linuxkit". screenshot

The phase of workflow for control plane is hanging at "Provisioning"

kubectl get machines -A
NAMESPACE     NAME                                        CLUSTER      NODENAME   PROVIDERID   PHASE          AGE   VERSION
eksa-system   david3-k8s-bk7gc                            david3-k8s                           Provisioning   90s   v1.23.7-eks-1-23-4
eksa-system   david3-k8s-worker-default-f648f996b-ngr4b   david3-k8s                           Pending        92s   v1.23.7-eks-1-23-4

When I followed the guide, there is a thing that didn't work as the document.

after step 6, the generated hook-kernel image tag is quay.io/tinkerbell/hook-kernel:5.10.85-b38f76a8ad6b24ceee1f5d1794a63d8233038707-dirty. So in step 7 error is occured, then I changed the image tag to localhost:5000/hook-kernel:5.18.85 and pushed to docker registry.
After pushed the image make dist step is succeeded.

I have another question. After v0.11, EKS-A no more provide ubuntu image. So I tried image build using image-builder with https://anywhere.eks.amazonaws.com/docs/reference/artifacts/#build-bare-metal-node-images. But I failed to build ubuntu image.

$ image-builder build --os ubuntu --hypervisor baremetal --release-channel 1-23
...
+ export CNI_VERSION=
+ CNI_VERSION=
+ TMP_CNI=/tmp/eks-image-builder-cni
+ mkdir -p /tmp/eks-image-builder-cni
+ curl -o /tmp/eks-image-builder-cni/cni-plugins.tar.gz //cni-plugins-linux-amd64-.tar.gz
curl: (3) URL using bad/illegal format or missing URL
make: *** [Makefile:182: setup-packer-configs-raw] Error 3
make: Leaving directory '/tmp/eks-anywhere-build-tooling/projects/kubernetes-sigs/image-builder'
2022/09/16 08:56:15 Error executing image-builder for raw hypervisor: failed to run command: exit status 2

Do you have any idea? Please Let me know if you need more information. Thank you.

Sep 16 '22 08:09 SeungyeopShin

Hi @ptrivedi, I'm waiting your help. please reply.... 😂

Sep 23 '22 08:09 SeungyeopShin

Hello @SeungyeopShin, can you share the full log output of the image-builder error? It does looks like manifest might've not been pulled correctly. Could you try again as image-builder user and run the cli from $HOME and not from /tmp or other directories. Please add a --force to clean up previously failed builds.

image-builder build --os ubuntu --hypervisor baremetal --release-channel 1-23 --force

Oct 06 '22 17:10 vignesh-goutham

Hello @vignesh-goutham . I succeeded building image with --force option. Thank you so much!!!!! 👍 👍👍👍👍👍👍👍👍👍

Oct 07 '22 00:10 SeungyeopShin

Great!

Oct 11 '22 14:10 ptrivedi

eks-anywhere eks-anywhere copied to clipboard

Boot hang after "random: crng init done" at worker node(ProLiant XL675d Gen10 Plus w/AMD EPYC 7543)

eks-anywhere
eks-anywhere copied to clipboard