docker-node
docker-node copied to clipboard
Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only
Slightly confused by what's happening here but we have a Docker image that was built with node:10-alpine and when this was rebuilt with Alpine 3.10 instead of 3.9 after this was moved to 3.10 we had the container enter a crash loop with a segfault during the startup of the application.
Rolling back to the previous image and also rolling forward with things pinned to node:10-alpine3.9 seems to make this go away. More weirdly I can't reproduce this on non AWS instances but can reliably reproduce on multiple AWS instances. I also noticed that when the container is running with --privileged
then it works fine.
Looking at the non debug build core dump it looks to be an issue in musl but I don't yet know what's triggering that without debug symbols:
#0 0x00007fe1375ee07e in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x00007fe1375eb4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x00007fe134c22b64 in ?? ()
#3 0x0000000000000000 in ?? ()
I'm also very confused why it wouldn't be segfaulting like this when it's ran outside of AWS or when the container is ran as privileged.
Any ideas on how I can debug this further?
I see a similar issue. Switching to node:10-alpine3.9
seemed to work
Same issue, same mitigation as OP running on AWS Kops 1.14 instances. Our other clusters with lower Kops version are not affected so it seems to be something tied to OS.
@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?
Me too... I'm using node:10.17.0-alpine
, it was based on Alpine 3.9 for weeks.
But it seems this commit: https://github.com/nodejs/docker-node/commit/c6bc44e84afcdb81d9749b7b034c60e916a519ad#diff-b24491fb48497b165ae0f777c31da853 brings a breaking change.
Since then, the tag 10.17.0-alpine
became an alias of 10.17.0-alpine3.10
, the old one was renamed as 10.17.0-alpine3.9
.
I guess this Segfault in node:10-alpine issue happens a lot on AWS this week. Two options to fix:
- Roll back to
10.17.0-alpine3.9
, or - Fix the musl-libc-glibc thread stack size difference by following https://github.com/nodejs/docker-node/issues/813#issuecomment-407339011 Option 2 (https://github.com/jubel-han/dockerfiles/blob/master/node/Dockerfile)
I would add its not happening only in AWS. Same issue started to appear in my on-premise environment specially with services using node-rdkafka library. Changing services to us node:10-alpine3.9 fixed the issue.
@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?
@olemarkus Apparently the issue is related to OS / node instances. We didn't have the problem with Kops 1.10 running on m4 instances but got it after upgrading to Kops 1.11 with m5 instances with the same docker images. Oh, forgot to mention that it also affects node-8.
re https://github.com/nodejs/docker-node/issues/1158#issuecomment-557410925:
The stack size issue crash from https://github.com/nodejs/docker-node/issues/813#issuecomment-407322127 is fixed in Node in https://github.com/nodejs/node/commit/5539d7e360a625e729a4f95e67d232d3400fc137 (I ran a bisect), which is included in v13, but not v10. A related libuv fix (https://github.com/libuv/libuv/commit/d89bd156cfc9f66003a430c31149c4b94e18b904) is also included in Node v13.
That crash reproduces on both node:10.17-alpine3.9
and node:10.17-alpine3.10
(i.e., node:10-alpine
). So, while I agree that this seems to be a stack size issue, it doesn't seem to be quite identical, since it existed on the old version as well.
We have encountered the same issue when using the Skylake cpu platform on GPC while trying to run the grpc package
Had the exact same issue but not related to AWS. We use Docker mainly for offline installations and it started to randomly happen on just a few machines. Build & deploy via GitLab CI (using the latest alpine) was always fine so it was a bit hard to track the issue down. The health check error via docker inspect
at least gave a small hint ... but it wasn't that meaningfull either:
OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown
cannot exec in a stopped state: unknown
As mentioned already reverting to 3.9 works for now.
EDIT: Not sure if it helps ... i've seen very similar errors (with just slightly different error numbers) related to kernel and runc in the past, but mainly for CentOS. However, i assume it's hardware or CPU related as it only happens on every second machine ... compared different setups with clean Win 10 Pro + Docker installations.
Also i've noticed strange CPU peaks (exceeding 100%) when running docker stats
during this sort of restart-loop:
fwiw I ran into this recently with node:10-alpine 3.10 and 3.11. 3.9 does not have the issue.
I can get node:10-alpine 3.10 and 3.11 to work if I copy Docker's default seccomp profile, add membarrier to the whitelisted syscalls and use that
@tomelliff could you help to disclose the spec of the AWS VM, docker version, etc? Since I failed to reproduce this. Thank you!