katib icon indicating copy to clipboard operation
katib copied to clipboard

Metrics collector fails to create watcher

Open drawesomenic opened this issue 4 years ago • 5 comments

/kind bug

What steps did you take and what happened: I started Katib runs using Kale which leads to about 50% of the pipelines succeeding and 50% of the pipelines failing randomly with the following error message of the "metrics-logger-and-collector" container:

Mon, Jan 10 2022 4:07:47 pm | I0110 15:07:47.414005 20 main.go:342] Trial Name: test-dev-blo6q-ptpgnwzg
Mon, Jan 10 2022 4:07:47 pm | 2022/01/10 15:07:47 FATAL -- failed to create Watcher
Mon, Jan 10 2022 4:07:47 pm | goroutine 34 [running]:
Mon, Jan 10 2022 4:07:47 pm | runtime/debug.Stack()
Mon, Jan 10 2022 4:07:47 pm | /usr/local/go/src/runtime/debug/stack.go:24 +0x65
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/util.Fatal({0xcc1a11, 0x0}, {0x0, 0x0, 0x0})
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc0000bc000)
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
Mon, Jan 10 2022 4:07:47 pm | created by github.com/hpcloud/tail/watch.glob..func1
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x173

What did you expect to happen: In the succeeding pipelines no error is thrown, but instead shows normal output:

Wed, Jan 5 2022 8:26:38 pm | I0105 19:26:37.970244 16 main.go:342] Trial Name: test-dev-gtbb0-847s8svl
Wed, Jan 5 2022 8:26:39 pm | I0105 19:26:39.075769 16 main.go:136] 2022-01-05 19:26:39 Kale kfputils:176 [INFO] Creating KFP experiment 'test-dev-gtbb0'...

Anything else you would like to add: I also tried increasing the resources via katib-config but it did not resolve the issue. The error does not occur with specific pipeline parameters but happens randomly. The workflow is completed successfully, however, as the "metrics-logger-and-collector" container fails, also the related job and trial fails.

Environment:

  • Katib version (check the Katib controller image version): 0.12.0
  • Kubernetes version: (kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
  • OS (uname -a): Linux dashboard-shell-w5nrd 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

drawesomenic avatar Jan 11 '22 08:01 drawesomenic

/kind question /priority p2 /area katib

jbottum avatar Jan 11 '22 16:01 jbottum

Thank you for creating this @drawesomenic. Did you try to use File metrics collector instead of StdOut ? Also, can you show me your Entrypoint command for the Trial training job container ?

andreyvelich avatar Jan 13 '22 15:01 andreyvelich

It might be this issue: https://github.com/hpcloud/tail/issues/151#issuecomment-747274673. Did you build your own Metrics Collector image on aarch64 ?

andreyvelich avatar Jan 13 '22 15:01 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 06:04 stale[bot]

I'm also getting this error periodically with the default metrics collector image on x86:

I0513 15:40:02.000547      66 main.go:394] Trial Name: orbit-dlt-g6gm7-6qwczzt9
2022/05/13 15:40:02 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/hpcloud/tail/util.Fatal({0xd083fb?, 0x0?}, {0x0, 0x0, 0x0})
	/go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc000132040)
	/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
created by github.com/hpcloud/tail/watch.glob..func1
	/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x16e

This was with the StdOut collector, though it looks like I can also replicate it with the File metrics collector:

metricsCollectorSpec:
  collector:
    kind: StdOut

If it matters, this is running on MicroK8s on my laptop.

knkski avatar May 13 '22 16:05 knkski

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 12 '23 15:09 github-actions[bot]

Sorry for the late reply. Are you still experience this issue ?

andreyvelich avatar Sep 12 '23 19:09 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 12 '23 00:12 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Jan 01 '24 00:01 github-actions[bot]

@andreyvelich could this be re-opened? I also hit this with docker.io/kubeflowkatib/file-metrics-collector:v0.16.0 at random intervals.

AndersBennedsgaard avatar Apr 04 '24 12:04 AndersBennedsgaard