eks-anywhere
eks-anywhere copied to clipboard
Boot hang after "random: crng init done" at worker node(ProLiant XL675d Gen10 Plus w/AMD EPYC 7543)
What happened: Hanging after random: crng init done" at worker node(ProLiant XL675d Gen10 Plus w/AMD EPYC 7543) Screenshot
What you expected to happen: boot normally
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: Server : ProLiant XL675d Gen10 Plus w/AMD EPYC 7543
Environment:
- EKS Anywhere Release: 0.10.1
- EKS Distro Release: 1.22-9
Hi @SeungyeopShin, what network drivers are present on this server? Are you using RAID? Any specific storage controllers/drivers? It would be useful if you are able to provide your hardware specifications. Thank you.
Hi @ptrivedi, The hardware information is below.
- Network
- Marvell FastLinQ 41000 Series - 2P 10GbE 10GBASE-T QL41132HLRJ-HC MD2 Adapter - NIC
- Intel(R) Ethernet Server Adapter I350-T4(Unused)
- RAID
- Controller : HPE Smart Array P408i-a SR Gen10
- OS : RAID 1
- Data : Non RAID
- GPU
- Baseboard : NVIDIA HGX A100 8 GPU 80
Thanks.
Hey there @SeungyeopShin , we're unable to access the linked screenshot.
Also, if you could provide a sanitized copy of your cluster configuration and the support bundle generated by the EKS A CLI it would aid in debugging. Thank you!
Hi, @danbudris . The screen shot is here Hanging is appeared only in GPU(HPE) server. And my cluster configuration is as below.(xxx means masking)
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
name: xxx-eksa
spec:
clusterNetwork:
cniConfig:
cilium: {}
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
count: 3
endpoint:
host: "10.255.xx.xx"
machineGroupRef:
kind: TinkerbellMachineConfig
name: xxx-eksa-cp
datacenterRef:
kind: TinkerbellDatacenterConfig
name: xxx-eksa
kubernetesVersion: "1.22"
managementCluster:
name: xxx-eksa
workerNodeGroupConfigurations:
- count: 2
machineGroupRef:
kind: TinkerbellMachineConfig
name: xxx-eksa-worker-default
name: worker-default
- count: 1
machineGroupRef:
kind: TinkerbellMachineConfig
name: xxx-eksa-worker-gpu
name: worker-gpu
# IAM Authenticator
identityProviderRefs:
- kind: AWSIamConfig
name: aws-iam-auth-config
podIamConfig:
serviceAccountIssuer: https://xxx.xxx.xxx
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: AWSIamConfig
metadata:
name: aws-iam-auth-config
spec:
awsRegion: "ap-northeast-2"
backendMode:
- "EKSConfigMap"
mapRoles:
- roleARN: xxx
username: xxx
groups:
- system:masters
partition: "aws"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
name: xxx-eksa
spec:
tinkerbellIP: "10.255.xx.xx"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
name: xxx-eksa-cp
spec:
hardwareSelector:
type: cp
osFamily: ubuntu
templateRef: {}
users:
- name: xxx
sshAuthorizedKeys:
- ssh-rsa AAAA...
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
name: xxx-eksa-worker-default
spec:
hardwareSelector:
type: worker-default
osFamily: ubuntu
templateRef: {}
users:
- name: xxx
sshAuthorizedKeys:
- ssh-rsa AAAA...
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
name: xxx-eksa-worker-gpu
spec:
hardwareSelector:
type: worker-gpu
osFamily: ubuntu
templateRef: {}
users:
- name: xxx
sshAuthorizedKeys:
- ssh-rsa AAAA...
---
Can you confirm if your other control and worker nodes were provisioned successfully and it just only the GPU node/s that are stuck in that stage?
Another thing to mention is that AWS IAM Auth isn't supported for Bare Metal yet but the support will be added in the upcoming 0.11 release
Hi @abhinavmpandey08. Yes, not GPU nodes are successfully provisioned and working well. (Not GPU nodes are not HPE servers and vm machine on XCP-ng.)
I known that AWS IAM Auth is not working now. So I manually set up the IAM Auth and IAM for Pod and that is working.
Hi @SeungyeopShin,
When attempting to get provisioned, these servers will first try and boot into the Hook OS. Hook is a minimal, small, in-memory, LinuxKit based OS. This OS works as a bootstrapping OS to start the full provisioning of the servers. We do not have the Marvell/QLogic qed/e drivers on the Hook OS version that we vend and publish with EKS Anywhere GA.
Here are the types of hardware on which we have validated EKS Anywhere.
For other types of wide varieties of hardware that people may have, if Hook OS does not contain the necessary drivers, we recommend you attempt to build your own Hook kernel. We have documented the detailed steps to accomplish that here
Once you have the kernel and initramfs built, you can host them at a location your cluster servers will be able to access via http(s) and specify that custom location via the clusterconfig as described here
Please let us know if you need further help. If you really get stuck, we can try and build the custom Hook OS for you.
Hi @ptrivedi , OK I'll try to make own Hook OS. Thanks.
Hi @ptrivedi
I tried to follow the guide you gave me. I added marvell and qlogic driver added in menuconfig.
When I boot using customized hook OS image, nothing happens after "Welcome Linuxkit". screenshot
The phase of workflow for control plane is hanging at "Provisioning"
kubectl get machines -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
eksa-system david3-k8s-bk7gc david3-k8s Provisioning 90s v1.23.7-eks-1-23-4
eksa-system david3-k8s-worker-default-f648f996b-ngr4b david3-k8s Pending 92s v1.23.7-eks-1-23-4
When I followed the guide, there is a thing that didn't work as the document.
- after step 6, the generated hook-kernel image tag is
quay.io/tinkerbell/hook-kernel:5.10.85-b38f76a8ad6b24ceee1f5d1794a63d8233038707-dirty. So in step 7 error is occured, then I changed the image tag to localhost:5000/hook-kernel:5.18.85 and pushed to docker registry. - After pushed the image
make diststep is succeeded.
I have another question. After v0.11, EKS-A no more provide ubuntu image. So I tried image build using image-builder with https://anywhere.eks.amazonaws.com/docs/reference/artifacts/#build-bare-metal-node-images. But I failed to build ubuntu image.
$ image-builder build --os ubuntu --hypervisor baremetal --release-channel 1-23
...
+ export CNI_VERSION=
+ CNI_VERSION=
+ TMP_CNI=/tmp/eks-image-builder-cni
+ mkdir -p /tmp/eks-image-builder-cni
+ curl -o /tmp/eks-image-builder-cni/cni-plugins.tar.gz //cni-plugins-linux-amd64-.tar.gz
curl: (3) URL using bad/illegal format or missing URL
make: *** [Makefile:182: setup-packer-configs-raw] Error 3
make: Leaving directory '/tmp/eks-anywhere-build-tooling/projects/kubernetes-sigs/image-builder'
2022/09/16 08:56:15 Error executing image-builder for raw hypervisor: failed to run command: exit status 2
Do you have any idea? Please Let me know if you need more information. Thank you.
Hi @ptrivedi, I'm waiting your help. please reply.... 😂
Hello @SeungyeopShin, can you share the full log output of the image-builder error? It does looks like manifest might've not been pulled correctly. Could you try again as image-builder user and run the cli from $HOME and not from /tmp or other directories. Please add a --force to clean up previously failed builds.
image-builder build --os ubuntu --hypervisor baremetal --release-channel 1-23 --force
Hello @vignesh-goutham .
I succeeded building image with --force option.
Thank you so much!!!!! 👍 👍👍👍👍👍👍👍👍👍
Great!