Metrics collector fails to create watcher
/kind bug
What steps did you take and what happened: I started Katib runs using Kale which leads to about 50% of the pipelines succeeding and 50% of the pipelines failing randomly with the following error message of the "metrics-logger-and-collector" container:
Mon, Jan 10 2022 4:07:47 pm | I0110 15:07:47.414005 20 main.go:342] Trial Name: test-dev-blo6q-ptpgnwzg
Mon, Jan 10 2022 4:07:47 pm | 2022/01/10 15:07:47 FATAL -- failed to create Watcher
Mon, Jan 10 2022 4:07:47 pm | goroutine 34 [running]:
Mon, Jan 10 2022 4:07:47 pm | runtime/debug.Stack()
Mon, Jan 10 2022 4:07:47 pm | /usr/local/go/src/runtime/debug/stack.go:24 +0x65
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/util.Fatal({0xcc1a11, 0x0}, {0x0, 0x0, 0x0})
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc0000bc000)
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
Mon, Jan 10 2022 4:07:47 pm | created by github.com/hpcloud/tail/watch.glob..func1
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x173
What did you expect to happen: In the succeeding pipelines no error is thrown, but instead shows normal output:
Wed, Jan 5 2022 8:26:38 pm | I0105 19:26:37.970244 16 main.go:342] Trial Name: test-dev-gtbb0-847s8svl
Wed, Jan 5 2022 8:26:39 pm | I0105 19:26:39.075769 16 main.go:136] 2022-01-05 19:26:39 Kale kfputils:176 [INFO] Creating KFP experiment 'test-dev-gtbb0'...
Anything else you would like to add: I also tried increasing the resources via katib-config but it did not resolve the issue. The error does not occur with specific pipeline parameters but happens randomly. The workflow is completed successfully, however, as the "metrics-logger-and-collector" container fails, also the related job and trial fails.
Environment:
- Katib version (check the Katib controller image version): 0.12.0
- Kubernetes version: (
kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
- OS (
uname -a): Linux dashboard-shell-w5nrd 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 Linux
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
/kind question /priority p2 /area katib
Thank you for creating this @drawesomenic. Did you try to use File metrics collector instead of StdOut ? Also, can you show me your Entrypoint command for the Trial training job container ?
It might be this issue: https://github.com/hpcloud/tail/issues/151#issuecomment-747274673. Did you build your own Metrics Collector image on aarch64 ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm also getting this error periodically with the default metrics collector image on x86:
I0513 15:40:02.000547 66 main.go:394] Trial Name: orbit-dlt-g6gm7-6qwczzt9
2022/05/13 15:40:02 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/hpcloud/tail/util.Fatal({0xd083fb?, 0x0?}, {0x0, 0x0, 0x0})
/go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc000132040)
/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
created by github.com/hpcloud/tail/watch.glob..func1
/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x16e
This was with the StdOut collector, though it looks like I can also replicate it with the File metrics collector:
metricsCollectorSpec:
collector:
kind: StdOut
If it matters, this is running on MicroK8s on my laptop.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Sorry for the late reply. Are you still experience this issue ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
@andreyvelich could this be re-opened? I also hit this with docker.io/kubeflowkatib/file-metrics-collector:v0.16.0 at random intervals.