katib
katib copied to clipboard
Use Kubeflow metadata for metrics collection
/kind feature
Describe the solution you'd like Right now Katib depends on logging the metrics to stdout (see #685).
It would be nice if instead Katib could be configured to use Kubeflow metadata to obtain the metrics.
Here's a strawman for how this might work
- User adds logging statement to their code to log metrics to metadata with an appropriate set of labels (e.g. experiment & trial)
- Katib use a selector to match trials to metrics in metadata
It seems natural for folks to instrument their code to log metrics to metadata.
Furthermore, using the metadata SDK to log metrics should mean logging metrics to metadata is no more difficult then logging to stdout.
A side benefit would be that this avoids some of the sideffects of using side cars to fetch logs from stdout (#685)
- Sidecars make it more difficult to determine when a job is completed.
- Logging to metadata its easier to write robust code to ensure that metrics are logged
- Training code gets an ACK from the metadata store and can retry in the event of failure
- In contrast if we rely on training code printing to stdout and being collected asynchronously the training code has no way of knowing whether metrics have been successfully preserved.
/cc @zhenghuiwang @johnugeorge @gaocegege
@jlewi @zhenghuiwang In fact, all metrics have been persisted into Katib DB (now we only implement mysql driver). and we can implement a new DB driver for kubeflow metadata, just like mysql counterpart.
Out of the box integration with metadata would be awesome.
Not sure the requirements of metadata. Now we only use katib-db to store metrics. If metadata does not require any other abstraction, I think it should be easy to support it.
Related: https://github.com/kubeflow/katib/issues/841#issuecomment-537413455
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen