pkg icon indicating copy to clipboard operation
pkg copied to clipboard

[flaky] TestMetricsExport is super flaky

Open vagababov opened this issue 4 years ago • 17 comments

metrics/resource_view_test.go - TestMetricsExport flakes most of the time.

/assign @jjzeng-seattle @evankanderson

here's a good example: https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/1666/pull-knative-pkg-unit-tests/1300475037081407489

vagababov avatar Aug 31 '20 21:08 vagababov

cc @yanweiguo (community oncall)

mattmoor avatar Aug 31 '20 21:08 mattmoor

I ran the tests hundreds times with my own cluster and could only reproduce the timeout error once. I guess I have to send out a PR to run the CICD to debug.

yanweiguo avatar Sep 02 '20 20:09 yanweiguo

/reopen

I still see issues, though with a different signature. Collecting signatures here.

=== CONT  TestMetricsExport
    resource_view_test.go:334: Created exporter at localhost:12345
    logger.go:130: 2020-09-27T02:15:16.884Z	INFO	metrics/exporter.go:155	Flushing the existing exporter before setting up the new exporter.
    logger.go:130: 2020-09-27T02:15:16.940Z	ERROR	websocket/connection.go:138	Websocket connection could not be established	{"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"}
    logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/opencensus_exporter.go:56	Created OpenCensus exporter with config:	{"config": {}}
    logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/exporter.go:168	Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil>  false 19090 false   {   false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false   {   false}}

and then:

    resource_view_test.go:370: Timeout reading input
    resource_view_test.go:376: Unexpected OpenCensus exports (-want +got):
          []metrics.metricExtract(Inverse(Sort, []string{
          	"knative.dev/serving/testComponent/global_export_counts<>:2",
          	"knative.dev/serving/testComponent/resource_global_export_count<>:2",
          	`knative.dev/serving/testComponent/testing/value<project="p1",rev`...,
        - 	`knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`,
          }))

evankanderson avatar Sep 30 '20 21:09 evankanderson

@evankanderson: Reopened this issue.

In response to this:

/reopen

I still see issues, though with a different signature. Collecting signatures here.

=== CONT  TestMetricsExport
   resource_view_test.go:334: Created exporter at localhost:12345
   logger.go:130: 2020-09-27T02:15:16.884Z	INFO	metrics/exporter.go:155	Flushing the existing exporter before setting up the new exporter.
   logger.go:130: 2020-09-27T02:15:16.940Z	ERROR	websocket/connection.go:138	Websocket connection could not be established	{"error": "dial tcp: lookup somewhere.not.exist on 10.7.240.10:53: no such host"}
   logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/opencensus_exporter.go:56	Created OpenCensus exporter with config:	{"config": {}}
   logger.go:130: 2020-09-27T02:15:16.975Z	INFO	metrics/exporter.go:168	Successfully updated the metrics exporter; old config: &{knative.dev/serving testComponent prometheus 1000000000 <nil> <nil>  false 19090 false   {   false}}; new config &{knative.dev/serving testComponent opencensus 1000000000 <nil> <nil> localhost:12345 false 0 false   {   false}}

and then:

   resource_view_test.go:370: Timeout reading input
   resource_view_test.go:376: Unexpected OpenCensus exports (-want +got):
         []metrics.metricExtract(Inverse(Sort, []string{
         	"knative.dev/serving/testComponent/global_export_counts<>:2",
         	"knative.dev/serving/testComponent/resource_global_export_count<>:2",
         	`knative.dev/serving/testComponent/testing/value<project="p1",rev`...,
       - 	`knative.dev/serving/testComponent/testing/value<project="p1",revision="r2">:1`,
         }))

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Sep 30 '20 21:09 knative-prow-robot

I'm slightly suspicious, seeing interleaved logging from different tests, if we're seeing side effects of the global monitoring singleton.

Unfortunately, it seems like it's hard to adjust our current prow test infrastructure to run these separately; let me look into doing it via GitHub Actions.

evankanderson avatar Sep 30 '20 21:09 evankanderson

Update: I've managed to reproduce this about 1/50 times when running all the tests under the e2e script.

It looks like somehow the default exporter is sometimes trying to export to the default localhost:55678 address rather than the address in the config. I'm still trying to figure out why this happens.

I've also found a small bug in the Prometheus exporter where it won't necessarily re-create the exporter if the port changes. Since that seems to happen rarely in the current scenarios, I'm going to roll that with the other change.

evankanderson avatar Oct 19 '20 16:10 evankanderson

I've fixed a few bugs, but I'm still seeing one case where the default meter seems to have lost track of all of the metrics associated with it.

evankanderson avatar Oct 21 '20 19:10 evankanderson

@evankanderson any updates on this?

vagababov avatar Dec 04 '20 21:12 vagababov

@evankanderson I still see this issue here: https://github.com/knative/pkg/pull/2005#issuecomment-776174080

skonto avatar Feb 09 '21 19:02 skonto

/reopen

TestMetricsExport/OpenCensus

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2189/pull-knative-pkg-unit-tests/1415384970213462016

dprotaso avatar Jul 14 '21 20:07 dprotaso

@dprotaso: Reopened this issue.

In response to this:

/reopen

TestMetricsExport/OpenCensus

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2189/pull-knative-pkg-unit-tests/1415384970213462016

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Jul 14 '21 20:07 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 13 '21 01:10 github-actions[bot]

/remove-lifecycle stale

still messed up https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_pkg/2328/pull-knative-pkg-unit-tests/1453084081465069568

benmoss avatar Oct 26 '21 19:10 benmoss

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 25 '22 01:01 github-actions[bot]

/remove-lifecycle stale

pierDipi avatar Jan 25 '22 07:01 pierDipi

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Apr 26 '22 01:04 github-actions[bot]

/lifecycle frozen

dprotaso avatar Apr 26 '22 21:04 dprotaso