bugs
bugs copied to clipboard
[AWS EBS] NVMe sometimes fails to attach
Issue Report
Container Linux Version
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1632.3.0
VERSION_ID=1632.3.0
BUILD_ID=2018-02-14-0338
PRETTY_NAME="Container Linux by CoreOS 1632.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Environment
- AWS EC2
m5.xlargeinstance - Encrypted EBS gp2 volume attached as
/dev/xvdba
Expected Behavior
Every single AWS EBS NVMe volume gets attached to the nodes.
Actual Behavior
Some volumes fail to be brought up by the kernel:
[ 2753.800168] pci 0000:00:1e.0: [1d0f:8061] type 00 class 0x010802
[ 2753.800270] pci 0000:00:1e.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 2753.801205] pci 0000:00:1e.0: BAR 0: assigned [mem 0xc0004000-0xc0007fff]
[ 2753.803879] nvme nvme2: pci function 0000:00:1e.0
[ 2753.805987] nvme 0000:00:1e.0: enabling device (0000 -> 0002)
[ 2753.919589] nvme nvme2: failed to mark controller live
[ 2753.920704] nvme nvme2: Removing after probe failure status: 0
Some similar volumes actually attached successfully:
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 vol0ff923da7ac79719f Amazon Elastic Block Store 1 0.00 B / 107.37 GB 512 B + 0 B 1.0
/dev/nvme1n1 vol031fc070a0ea9b7b7 Amazon Elastic Block Store 1 0.00 B / 42.95 GB 512 B + 0 B 1.0
$ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid : 0x1d0f
ssvid : 0x1d0f
sn : vol0ff923da7ac79719f
mn : Amazon Elastic Block Store
fr : 1.0
rab : 32
ieee : dc02a0
cmic : 0
mdts : 6
cntlid : 0
ver : 0
rtd3r : 0
rtd3e : 0
oaes : 0
oacs : 0
acl : 4
aerl : 0
frmw : 0x3
lpa : 0
elpe : 0
npss : 1
avscc : 0x1
apsta : 0
wctemp : 0
cctemp : 0
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 0
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0
fuses : 0
fna : 0
vwc : 0x1
awun : 0
awupf : 0
nvscc : 0
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:0.01W operational enlat:1000000 exlat:1000000 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:0.00W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
I am not entirely sure of the exact situation that re-creates that issue. It might be / have been an issue on AWS's side.
I have spawned and attached ~500 volumes tonight, on both 1632.2.1 / 1632.3.0 and haven't reproduced it. Only saw a few volumes stuck in attaching state (impaired volumes), but at this stage, it doesn't even reach the instance/OS anyways. However, the past two days (including today), I've seen a few volumes not being able to be recognized on certain nodes until I kill that node and get a new one in the AZ.
I will update if I can reproduce it again.
We have seen reports of several hiccups specific to m5 instances (lockups, long pauses): https://github.com/coreos/bugs/issues/2326.
Is this NVMe attach failure specific to this kind of instances? If so, you may want to check with AWS support to see if they already have some ticket related to this.
For now I'll assume this was a temporary AWS issue. If you can reproduce it again, please do reopen with additional details!
Having the same issue with 1632.3.0 on m5.large
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1632.3.0
VERSION_ID=1632.3.0
BUILD_ID=2018-02-14-0338
PRETTY_NAME="Container Linux by CoreOS 1632.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Running Kubernetes 1.9.6 and it fails to attach one of EBS volumes.
dmesg
[Mon Apr 16 11:35:24 2018] nvme nvme2: pci function 0000:00:1c.0
[Mon Apr 16 11:35:24 2018] nvme nvme2: failed to mark controller live
[Mon Apr 16 11:35:24 2018] nvme nvme2: Removing after probe failure status: 0
[Mon Apr 16 11:38:24 2018] pci 0000:00:1c.0: [1d0f:8061] type 00 class 0x010802
[Mon Apr 16 11:38:24 2018] pci 0000:00:1c.0: reg 0x10: [mem 0xc0004000-0xc0007fff]
[Mon Apr 16 11:38:24 2018] pci 0000:00:1c.0: BAR 0: assigned [mem 0xc0004000-0xc0007fff]
[Mon Apr 16 11:38:24 2018] nvme nvme2: pci function 0000:00:1c.0
[Mon Apr 16 11:38:24 2018] nvme nvme2: failed to mark controller live
[Mon Apr 16 11:38:24 2018] nvme nvme2: Removing after probe failure status: 0
kubelet
Apr 16 11:40:41 ip-10-113-49-225 kubelet[16829]: I0416 11:40:41.991739 16829 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "elasticsearch-002" (UniqueName: "kubernetes.io/aws-ebs/vol-074007bc7b169482f") pod "elasticsearch-2" (UID: "f7e842c3-4162-11e8-bb60-0ef36e0452b8")
Apr 16 11:40:41 ip-10-113-49-225 kubelet[16829]: E0416 11:40:41.995143 16829 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/vol-074007bc7b169482f\"" failed. No retries permitted until 2018-04-16 11:42:43.995110037 +0000 UTC m=+4568.955848612 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"elasticsearch-002\" (UniqueName: \"kubernetes.io/aws-ebs/vol-074007bc7b169482f\") pod \"elasticsearch-2\" (UID: \"f7e842c3-4162-11e8-bb60-0ef36e0452b8\") "
Apr 16 11:41:03 ip-10-113-49-225 kubelet[16829]: E0416 11:41:03.619703 16829 kubelet.go:1630] Unable to mount volumes for pod "elasticsearch-2_default(f7e842c3-4162-11e8-bb60-0ef36e0452b8)": timeout expired waiting for volumes to attach/mount for pod "default"/"elasticsearch-2". list of unattached/unmounted volumes=[datadir]; skipping pod
Apr 16 11:41:03 ip-10-113-49-225 kubelet[16829]: E0416 11:41:03.619753 16829 pod_workers.go:186] Error syncing pod f7e842c3-4162-11e8-bb60-0ef36e0452b8 ("elasticsearch-2_default(f7e842c3-4162-11e8-bb60-0ef36e0452b8)"), skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"elasticsearch-2". list of unattached/unmounted volumes=[datadir]
There are 3 more EBS volumes attached without any issues.
$ nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 vol03244c2fafa745b7f Amazon Elastic Block Store 1 0.00 B / 137.44 GB 512 B + 0 B 1.0
/dev/nvme1n1 vol0c1fa0d6da8aac5d4 Amazon Elastic Block Store 1 0.00 B / 21.47 GB 512 B + 0 B 1.0
/dev/nvme3n1 vol010573e38c819c12f Amazon Elastic Block Store 1 0.00 B / 53.69 GB 512 B + 0 B 1.0
/dev/nvme4n1 vol095a94583ea1a2102 Amazon Elastic Block Store 1 0.00 B / 53.69 GB 512 B + 0 B 1.0
I can confirm @juris case, 'cos today I had the same
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: pci 0000:00:1f.0: reg 0x10: [mem 0xc0000000-0xc0003fff]
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: pci 0000:00:1f.0: BAR 0: assigned [mem 0xc0000000-0xc0003fff]
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: nvme nvme1: pci function 0000:00:1f.0
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: nvme nvme1: failed to mark controller live
Apr 16 08:53:15 ip-172-31-15-226.ec2.internal kernel: nvme nvme1: Removing after probe failure status: 0
This basically happens all the time on my Kubernetes nodes, and completely screw over my Kubernetes clusters on C5s as, while EC2 thinks the volume is properly attached: EC2 will not be able to detach it without the ‘force’ option. Therefore, 1/ the pod will not start on the node 2/ after getting the pod rescheduled, Kubernetes will try to detach and reattach the volume somewhere else, but the volume will not get stuck in its ‘detaching’ state and won’t ever start on any other node before someone a/ force detach 2/ drain+kill the node.
The volume will get attached to other Container Linux nodes properly, if say node is not borked as well. Note that I’m relatively small clusters, people usually have 1-3 nodes per AZs, therefore any node that’s corrupted can create serious disruption on the high availability guarantee of the cluster, as volumes are bound to specific AZs.
@Quentin-M when Kubernetes can not detach the volume - is the volume still mounted on the node? It sounds like this could be a problem somewhere in EBS stack rather than Kubernetes itself. Have you talked to Amazon and what do they say?
The volume never gets mounted, as shown in the nvme list above. The
kernel fails to bring up the NVMe Controller for the device.
Kubernetes is not the culprit, it is between the OS and AWS. It is AWS that
fails to detach the volume without --force.
I have created a ticket at AWS, no answer so far. However, the volume can be mounted successfully on other machines.
On Tue, Apr 17, 2018 at 8:37 AM, Hemant Kumar [email protected] wrote:
@Quentin-M https://github.com/Quentin-M when Kubernetes can not detach the volume - is the volume still mounted on the node? It sounds like this could be a problem somewhere in EBS stack rather than Kubernetes itself. Have you talked to Amazon and what do they say?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/bugs/issues/2371#issuecomment-382038293, or mute the thread https://github.com/notifications/unsubscribe-auth/ABRUQXKJ52SfvnTPnBFZIizvu34u0umFks5tpgxBgaJpZM4SRxpz .
Why do you have to force detach the volume if it is not mounted on the node? a normal detach should do the job.
Can you post output of ls /dev/disk/by-id/ from the node where volume is being attached but not mounted? if volume is attached but not mounted - it could be a problem with Kubernetes.
So it appears that - although EC2 thinks volume is attached on the node, but when you login to the node - you can't see the disk? The disk is not mounted and not even attached (doesn't show up in lsblk or nvme list or ls /dev/disk/by-id )? Is that correct?
If that is indeed the case - then the problem is somewhere within Amazon stack and not kube.
@Quentin-M it looks like kernels 4.16 and 4.17 are getting a lot of NVMe-related stability/correctness patches (I'm specifically looking at this and this, but I'm not very familiar with NVMe) and it looks like you are hitting some kind of initialization race between the virtual controller and the Linux driver (where we don't really know which side is to blame).
Next alpha will ship with 4.16.2, which may (or may not) alleviate it. Could this race be related to having multiple devices being initialized at the same time? Do you also experience the same initialization failure with a single device or with multiple devices but attached at separate times?
After a few back and forth with AWS:
The behaviour you have observed with relation to Force Detach is expected, as this results in the EBS volume being removed without first clearing any open handles from the kernel in your instance. This can be likened to detaching a SATA cable from a drive that is mounted to the system, without first unmounting it.
From the attached NVMe logs and bug report on GitHub, it does look like it's an issue with the NVMe support in your kernel, however we're still investigating on our side to rule out any issues with EBS.
We will be keeping this case open until the investigation has concluded, however if you have any updates from your end feel free to let us know through this case.
Could we please re-open this issue for the time being, as multiple users are affected, and as it is being investigated as a real issue?
@Quentin-M alpha 1758.0.0 is out with kernel 4.16.3, did you have a chance to try that yet?
Have you noticed any correlation with the EBS volume size? I'm seeing a very similar issue but preventing bootup on instances with large root volumes. It works fine with a small root volume.
@lucab I switched my cluster to alpha and now periodically force non-critical pods that use EBS to reschedule on other nodes. Already three successful reschedules with EBS attach-detach, which was never successful on stable branch.
We see similar behavior in NixOS when booting up M5 instance with big root volume (~200 Gb) https://github.com/NixOS/nixpkgs/issues/39867
@lucab @Quentin-M happens again on alpha with 4.16.7
We are having this issue since stable 1745.3.1 (kernel 4.14.42) and still have with 1745.5.0 (kernel 4.14.44)
Can confirm this error on m5.2xlarge instances NAME="Container Linux by CoreOS" ID=coreos VERSION=1745.7.0 VERSION_ID=1745.7.0 BUILD_ID=2018-06-14-0909 PRETTY_NAME="Container Linux by CoreOS 1745.7.0 (Rhyolite)" ANSI_COLOR="38;5;75" HOME_URL="https://coreos.com/" BUG_REPORT_URL="https://issues.coreos.com" COREOS_BOARD="amd64-usr"
Does anyone know if this is still an issue on 1800.3.0 or 1828.0.0? It is preventing any Container Linux users from using m5/c5 machines reliability.
@Quentin-M I would suggest to check 1828.0.0, which is the first release shipping with 4.17. That should hopefully bring stability to the driver. If not, I would suggest reporting this directly to the kernel subsystem maintainers to track down the issue.
Did anyone tried to tune nvme_core.io_timeout and nvme_core.max_retries as recommended inside the AWS docs https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html#timeout-nvme-ebs-volumes?
I tried m5.large using CoreOS-stable-1632.3.0-hvm (ami-692faf11) with kernel 4.14.67. The test is keep attaching and detaching a volume to an existing instance repetitively in a loop. But I couldn't reproduce the issue.
Is the issue happening when volumes are attached to a new instance when the instance is created. Or is the issue happening when volumes are attached to an existing instance? And how do we know this is EBS issue but not Core OS kernel issue?
I think the issue happens when volume is stuck in detaching state on one instance and then it eventually detaches and if attaches somewhere else, the device does not show up.
I haven't been able to reproduce this as well. But the issue occurs regardless of Kernel versions I think. I have seen people report this on RHEL/3.10.x kernels.
I've often seen this when draining an m5 / c5 kubernetes node, which is running replicas of a statefulset. Frequently the rescheduled replica failed to mount their persistent volumes, reverting to c4 / m4 directly resolved the issue.
I would assume the CoreOS issue might be related to https://github.com/coreos/bugs/issues/2484, which might explain why it only occasionally happens (the timeout is hit)
Looks like a fix is ready, but waiting for review https://github.com/coreos/coreos-overlay/pull/3366
In our case we were hit by this issue after rebooting cluster nodes one by one (with multiple attached NVME disks), so after that almost all nodes had errors about NVME in kernel log. For me it looks like when node has many (and big sized) volumes and tries to attach them at once (e.g. reboot) this hits default timeout limits.
We've applied bigger timeouts (255 sec + 10 retries) and seems this a deal breaker for us.
Some NVMe issues appear to be fixed in the 4.19 kernel, which is included in current Container Linux alpha. Could you try the current alpha and see if the problem still exists?
For the sake of audience awareness, I have experienced the same issue on t3.medium EC2 instance type with the 4.14.78 kernel.
Enlarged timeout hasn't helped. Switching to the t2 family has.
I am able the reproduce the issue by following steps here
My setups are:
- kernel - 4.14.88-88.76.amzn2.x86_64
- amazon linux 2
- m5.2xlarge
- gp2 200GiB
Also verified nvme_core.timeout is set to 4294967295 through dmesg.