Error generating reports
Hello, I am using this tool, congratulations it is very good, but I have noticed that when a segment fault is generated, it sometimes generates all the files with another namespace name.
I attach evidence.
- cluster: EKS v1.21
- core dump version:
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
core-dump-handler observe 1 2022-07-01 04:33:50.377219926 +0000 UTC deployed core-dump-handler-v8.6.0 v8.6.0
- pod-info file: it contains the namespace
env-1f1de3e2bda8, when in fact this pod is in the namespace:env-e4e2facbcb22
It occurs to me to update core-dump to the newest version, I don't know if this will solve this problem.
Do you have any idea how I could debug this error?
Regards, Gustavo.
Hi @chzgustavo Thanks for the feedback really appreciate it.
Do you have pods with the same name running in different namespaces?
Background
The information from crio is currently queried using the hostname of the crashing container which is assumed to be unique.
This container hostname is then used to match to the pod. https://github.com/IBM/core-dump-handler/blob/main/core-dump-composer/src/main.rs#L75
It isn't ideal but using the hostname is the only way to try and catch the crashing container information that I am aware of.
This isn't an issue in most deployment scenarios as people tend to use replicasets/deployments that generates a unique id for each pod.
However if you are creating pods directly in each namespace then you may have the potential to hit a name clash issue.
Possible Solution
If that sounds like the problem I would suggest giving each pod a unique name when provisioning.
Yes, indeed, I have many pods with the same name running in different namespaces. The pods that generate segment fault belong to statefulset resources.
They all have the same hostname (but they are in different namespaces), is there any other possible solution for this case? Thanks for your help!
Sorry I'm not aware of another possible solution.
Statefulsets intentionally label the pods with ordinal numbers https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-identity.
If you're using helm you can add the namespace to the statefulset name which would resolve this. I know it's clunky but it should resolve it handily enough.
The underlying issue here is that the container kernel.core_pattern is per host and not per container so it's not possible to feed dynamic info from the pod to the kernel at runtime.
As systemd becomes more pod aware there may be a possibility to do something there but the last time I looked it just seemed to pass through to the system code.
[Edit] I will add this to the FAQ as it seems like it would be a fairly common scenario that will trip others up.
[Edit2] I'll double check the statuses in the responses from CRIO it may be possible to detect if the pod is crashing and if it isn't then move on to the next pod. I seem to remember looking at this when I wrote it and it wasn't possible but I'll double check. I won't get to that for a bit though as I have to look at #114 first.
Hi @No9,
Thanks for the information. We continue with this bug in production since we didn't apply the "clunky workaround".
Did you make some progress to fix it?
This project is very useful for us, thanks for the good work.