containerd icon indicating copy to clipboard operation
containerd copied to clipboard

Use UnmarshalTo instead of UnmarshalAny when dealing with cgroup metrics.

Open alam0rt opened this issue 2 years ago • 10 comments

After encountering the issue here: https://github.com/containerd/containerd/issues/9052, it appears that the type assertion was failing as the typeof data was an interface{} despite the concrete type being a pointer to v1.Metrics. We can instead unmarshal the data directly into the expected type variable which appears to work as expected.

alam0rt avatar Oct 05 '23 00:10 alam0rt

Hi @alam0rt. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 05 '23 00:10 k8s-ci-robot

The unit test below didn't result in any error to me (with v1.7.5). Do you mind running it on your end?

func TestMetrics(t *testing.T) {
	m := &v1.Metrics{
		CPU: &stats.CPUStat{
			Usage: &stats.CPUUsage{
				Total: 50,
			},
		},
	}
	stats, err := typeurl.MarshalAny(m)
	assert.NoError(t, err)

	data, err := typeurl.UnmarshalAny(stats)
	assert.NoError(t, err)
	_, ok := data.(*v1.Metrics)
	assert.True(t, ok)

	dst := &v1.Metrics{}
	err = typeurl.UnmarshalTo(stats, dst)
	assert.NoError(t, err)
}

henry118 avatar Oct 07 '23 00:10 henry118

The unit test below didn't result in any error to me (with v1.7.5). Do you mind running it on your end?

func TestMetrics(t *testing.T) {
	m := &v1.Metrics{
		CPU: &stats.CPUStat{
			Usage: &stats.CPUUsage{
				Total: 50,
			},
		},
	}
	stats, err := typeurl.MarshalAny(m)
	assert.NoError(t, err)

	data, err := typeurl.UnmarshalAny(stats)
	assert.NoError(t, err)
	_, ok := data.(*v1.Metrics)
	assert.True(t, ok)

	dst := &v1.Metrics{}
	err = typeurl.UnmarshalTo(stats, dst)
	assert.NoError(t, err)
}

Damn, won't compile on OSX due to build flags.

I tried using ctr metrics

Oct 10 00:08:24 containerd[1448]: {"error":null,"level":"error","msg":"invalid metric type for 535655fc1c59d67a1c3c0e6a0f81f1636b6758485b80e162c730be9f86de0360","time":"2023-10-10T00:08:24.769770849Z"}
ID                                                                  TIMESTAMP
535655fc1c59d67a1c3c0e6a0f81f1636b6758485b80e162c730be9f86de0360    seconds:1696896532  nanos:163248295

METRIC                   VALUE
memory.usage_in_bytes    30474240
memory.limit_in_bytes    268435456
memory.stat.cache        2060288
cpuacct.usage            2204998916
cpuacct.usage_percpu     [144178557 89870576 31787241 48788881 365999279 34713446 43435968 539148328 53590245 112969897 453806691 44179111 145900230 33398840 31158931 32072695]
pids.current             24
pids.limit               0

No issue there. Will continue looking

alam0rt avatar Oct 10 '23 00:10 alam0rt

Some more info: I believe this started occuring after the upgrade to 1.7.0 which I can see is when gogoproto was removed?

I added a print to https://github.com/containerd/containerd/blob/fe457eb99ac0e27b3ce638175ef8e68a7d2bc373/vendor/github.com/containerd/typeurl/v2/types.go https://github.com/zendesk/containerd/blob/fe457eb99ac0e27b3ce638175ef8e68a7d2bc373/vendor/github.com/containerd/typeurl/v2/types.go and can see that the typeURL Is gogoproto

		case gogoproto.Message:
			err = gogoproto.Unmarshal(value, t)
			fmt.Printf("(typeURL: %v), gogoproto: %s\n", typeURL, t.String())
		}

containerd[58435]: (typeURL: io.containerd.cgroups.v1.Metrics), gogoproto: &Metrics{H....

alam0rt avatar Oct 10 '23 00:10 alam0rt

@alam0rt What build of containerd are you using? I was wondering how the types could conflict since a panic will occur with multiple registrations. However, typeurl, gogoproto, and protobuf packages each have their own global registrations. I'm wondering why your build has this type registered while others don't seem to.

dmcgowan avatar Oct 10 '23 03:10 dmcgowan

@alam0rt What build of containerd are you using? I was wondering how the types could conflict since a panic will occur with multiple registrations. However, typeurl, gogoproto, and protobuf packages each have their own global registrations. I'm wondering why your build has this type registered while others don't seem to.

Hey @dmcgowan, we aren't using anything to esoteric I don't think

# containerd github.com/containerd/containerd v1.7.5 https://github.com/containerd/containerd/commit/fe457eb99ac0e27b3ce638175ef8e68a7d2bc373

$ runc --version
runc version 1.1.8
commit: v1.1.8-0-g82f18fe0
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4

alam0rt avatar Oct 10 '23 05:10 alam0rt

Any guidance on this would be great - not sure why we are seeing this error considering the code seems fine + we are running a proper release

alam0rt avatar Oct 19 '23 23:10 alam0rt

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 20 '24 03:01 k8s-ci-robot

Hi @alam0rt, @dmcgowan, @kiashok, @henry118 I would like to flag this patch as being important, I am currently seeing issues with stats when using Containerd 1.7.x with default runC (and gVisor), however when switching back to any of the 1.6.x releases this problem is resolved. I have also checked the stats returned from the various shims and they line up with the expected fields required by Containerd.

I also see that ctr currently uses the suggested method by @alam0rt to parse the metrics results from the shim API (see here), so it would be a good idea to keep the implementations similar anyway?

Champ-Goblem avatar Jan 31 '24 18:01 Champ-Goblem

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar May 01 '24 00:05 github-actions[bot]

This PR was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar May 09 '24 00:05 github-actions[bot]

I would suggest reopening this pr as the issue still exists and this pr fixes it.

Unmarshal io.containerd.cgroups.v1.Metrics with UnmarshalAny in 1.7+ (don't track which version it started with, but the latest version still has this issue) will first be parsed as github.com/containerd/cgroups/stats/v1.Metrics.

If UnmarshalAny in a unit test it shouldn't have this problem because it's not importing the wrong Metrics

This is due to hcsshim causing the dependency to have both github.com/containerd/cgroups/stats/v1/ and . /vendor/github.com/containerd/cgroups/v3/cgroup1/stats

How to reproduce this issue: env:

  • containerd release/1.7 or 1.7.19
  • cgroup v1

steps:

  1. configure /etc/containerd/config.toml to enable metrics
[metrics]
  address = "127.0.0.1:7000"
  1. curl 127.0.0.1:7000/v1/metrics | grep ^container_
  2. You will notice that there is no container_* indicator.
  3. Check the logs of containerd, error log -> {"error":null,"level":"error","msg":"invalid metric type for ...

Iceber avatar Jul 11 '24 07:07 Iceber

This pr should rebase though, and it's not the issue in 2.0 because there aren't two v1/stat.Metrics

Iceber avatar Jul 11 '24 07:07 Iceber