core-dump-handler icon indicating copy to clipboard operation
core-dump-handler copied to clipboard

Get podname and namespace "unknown"

Open wsszh opened this issue 3 years ago • 16 comments

Hi, I set the filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{podname}-{namespace}", but I get the filename like this: "9a1fc79c-758c-4599-a22d-2e94444a3250-dump-1657867608-segfaulter-segfaulter-1-4-unknown-unknown.zip". How to fix it?

wsszh avatar Jul 15 '22 06:07 wsszh

Hey, I also got this unknown. How did you solve it?

joaogbcravo avatar May 22 '23 11:05 joaogbcravo

We see his behaviour here as well (on AWS) any news? I see the issue is set the closed, however it doesn't seems to be resolved?

Robert-Stam avatar Jan 16 '24 15:01 Robert-Stam

Hey @Robert-Stam Can you confirm which aws.values.xxx.yaml you have used in the deployment and which version of EKS you are using. It's likely that the version of crio is now outdated as this hasn't been updated for a while.

No9 avatar Jan 17 '24 00:01 No9

Hey @Robert-Stam Can you confirm which aws.values.xxx.yaml you have used in the deployment and which version of EKS you are using. It's likely that the version of crio is now outdated as this hasn't been updated for a while.

I have used the settings from: https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.aws.yaml

We are using Kubernetes 1.28 (on Intel hardware, m6i family) with the AMI: amazon-eks-node-1.28-v20240110 See: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240110

image

Thanks in advance!

Robert-Stam avatar Jan 17 '24 08:01 Robert-Stam

@No9 Hi Anton, any update on this?

Robert-Stam avatar Mar 15 '24 14:03 Robert-Stam

I don't have access to an AWS account to debug. Can you log into an agent container that has processed a core dump and provide the output of

cat /var/mnt/core-dump-handler/composer.log

If there are no errors can you enable debugging by setting https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L27 to Debug

No9 avatar Mar 22 '24 22:03 No9

I tested with k8s v1.29 on AKS (Azure) and GKE (Google), and it all resolves to 'unknown' as namespace. This is the output from the composer log on AKS

ERROR - 2024-04-05T09:41:43.149688332+00:00 - failed to create pod at index 0
ERROR - 2024-04-05T09:41:47.803435709+00:00 - Failed to get pod id

Hope this helps.

Robert-Stam avatar Apr 05 '24 09:04 Robert-Stam

@No9 I tried to create a small PR to update the packages and crictr version, however without luck. FYI, here is my PR: https://github.com/IBM/core-dump-handler/pull/158

Do you have tried k8s v1.29 in the IBM cloud successfully?

Robert-Stam avatar Apr 05 '24 11:04 Robert-Stam

crictl is already on the host on IKS and others so it isn't a useful test. Did you look for the compose logs as per this comment? https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968

No9 avatar Apr 08 '24 18:04 No9

crictl is already on the host on IKS and others so it isn't a useful test.

Did you look for the compose logs as per this comment?

https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2016027968

See: https://github.com/IBM/core-dump-handler/issues/102#issuecomment-2039364826

Robert-Stam avatar Apr 08 '24 19:04 Robert-Stam

Do you have tried k8s v1.29 in the IBM cloud successfully?

Do you have tried k8s v1.29 in the IBM cloud successfully?

Robert-Stam avatar Apr 08 '24 19:04 Robert-Stam

Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:

crictl pods  --name <hostname> -o json

where <hostname> is captured from the crashing container.

Are you overriding the hostname on the deployed workloads?

In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.

kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never

No9 avatar Apr 08 '24 21:04 No9

Sorry I missed your log output post for some reason. So it appears as though this command is executing but not returning a list of pods:

crictl pods  --name <hostname> -o json

where <hostname> is captured from the crashing container.

Are you overriding the hostname on the deployed workloads?

In the meantime I'll take a look at a 1.29 cluster to confirm. [Edit] Confirmed that the core dump works as expected on IBM Cloud IKS 1.29 with no additional values parameters. Tested with the following failing container.

kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never

I am not overriding the hostname

To make sure we are on the same page, you did test with {namespace} in the filenameTemplate and that is filled out correctly?

Robert-Stam avatar Apr 09 '24 07:04 Robert-Stam

Revalidated with this config

composer:
  ignoreCrio: false
  crioImageCmd: "img"
  logLevel: "Warn"
  filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{namespace}"

Ran kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never

The following output from the container showing the default namespace is obtained.

[2024-04-09T20:04:27Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/3fb6b86a-6726-4f5c-80fd-f34e8a971536-dump-1712693067-segfaulter-segfaulter-1-4-default.zip
[2024-04-09T20:04:27Z INFO  core_dump_agent] zip size is 28610
[2024-04-09T20:04:27Z INFO  core_dump_agent] S3 Returned: 200

Can I suggest getting a debug container on the host and establishing what happens when the following is ran: If JSON is returned can you either post it here and/or validate it in the test suite.

crictl pods  --name <hostname> -o json

Thanks [Edit] kubernetes info IBM Kubernetes Service 1.29.3_1531

No9 avatar Apr 09 '24 20:04 No9

@No9 Anton, I executed your command in the running container (ibm/core-dump-handler:v8.10.0) on AWS (with k8s v1.29). This is the result

[root@core-dump-lgd5p app]# ./crictl pods  --name ip-10-87-16-57.eu-west-2.compute.internal -o json
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0002] connect endpoint 'unix:///var/run/dockershim.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
ERRO[0004] connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
FATA[0006] connect: connect endpoint 'unix:///run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded

And these are the settings applied (based on the log)

[2024-04-10T09:21:31Z INFO  core_dump_agent] Writing composer .env
    LOG_LEVEL=Warn
    IGNORE_CRIO=false
    CRIO_IMAGE_CMD=img
    USE_CRIO_CONF=false
    FILENAME_TEMPLATE={namespace}-{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}
    LOG_LENGTH=500
    POD_SELECTOR_LABEL=
    TIMEOUT=600
    COMPRESSION=true
    CORE_EVENTS=false
    EVENT_DIRECTORY=/var/mnt/core-dump-handler/events

Robert-Stam avatar Apr 10 '24 09:04 Robert-Stam

OK it looks like you are trying to run crictl from the handler container. What I was trying to suggest was setting up a debug session on the node. e.g.

kubectl get nodes 
NAME             STATUS   ROLES           AGE    VERSION
node1   Ready    master,worker   176d   v1.26.9+52589e6
node2   Ready    master,worker   176d   v1.26.9+52589e6
node3    Ready    master,worker   176d   v1.26.9+52589e6

With the node name, it doesn't matter which, run

 kubectl debug node/node1 --image=ubuntu

When you have a debug session run something like the following:

/host/usr/bin/crictl -r unix:///host/run/crio/crio.sock pods  --name core-dump-lgd5p -o json

Where /host/usr/bin/crictl is the location of wherever you have configured to copy crictl and unix:///host/run/crio/crio.sock is the crio.socket which may be in a different location and core-dump-lgd5p is the pod name

Expected output:

{
  "items": [
    {
      "id": "df2bb27cbc78c2fb51aea8cb2f9eeb6124c871244a5fb71e989458bb673125df",
      "metadata": {
        "name": "core-dump-handler-7kqc6",
        "uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
        "namespace": "observe",
        "attempt": 0
      },
      "state": "SANDBOX_READY",
      "createdAt": "1712691523593249607",
      "labels": {
        "controller-revision-hash": "7b6c988b5d",
        "io.kubernetes.container.name": "POD",
        "io.kubernetes.pod.name": "core-dump-handler-7kqc6",
        "io.kubernetes.pod.namespace": "observe",
        "io.kubernetes.pod.uid": "c8ea5ce9-72be-4826-82b3-b8c3a8144d50",
        "name": "core-dump-ds",
        "pod-template-generation": "1"
      },
      "annotations": {
        "kubectl.kubernetes.io/default-container": "coredump-container",
        "kubernetes.io/config.seen": "2024-04-09T14:38:43.120492372-05:00",
        "kubernetes.io/config.source": "api",
        "openshift.io/scc": "core-dump-admin-privileged"
      },
      "runtimeHandler": ""
    }
  ]
}

No9 avatar Apr 10 '24 12:04 No9