katib Use Kubeflow metadata for metrics collection

/kind feature

Describe the solution you'd like Right now Katib depends on logging the metrics to stdout (see #685).

It would be nice if instead Katib could be configured to use Kubeflow metadata to obtain the metrics.

Here's a strawman for how this might work

User adds logging statement to their code to log metrics to metadata with an appropriate set of labels (e.g. experiment & trial)
Katib use a selector to match trials to metrics in metadata

It seems natural for folks to instrument their code to log metrics to metadata.

Furthermore, using the metadata SDK to log metrics should mean logging metrics to metadata is no more difficult then logging to stdout.

A side benefit would be that this avoids some of the sideffects of using side cars to fetch logs from stdout (#685)

Sidecars make it more difficult to determine when a job is completed.
Logging to metadata its easier to write robust code to ensure that metrics are logged
- Training code gets an ACK from the metadata store and can retry in the event of failure
- In contrast if we rely on training code printing to stdout and being collected asynchronously the training code has no way of knowing whether metrics have been successfully preserved.

/cc @zhenghuiwang @johnugeorge @gaocegege

Oct 08 '19 15:10 jlewi

@jlewi @zhenghuiwang In fact, all metrics have been persisted into Katib DB (now we only implement mysql driver). and we can implement a new DB driver for kubeflow metadata, just like mysql counterpart.

Oct 09 '19 05:10 hougangliu

Out of the box integration with metadata would be awesome.

Oct 09 '19 06:10 jlewi

Not sure the requirements of metadata. Now we only use katib-db to store metrics. If metadata does not require any other abstraction, I think it should be easy to support it.

Oct 09 '19 06:10 gaocegege

Related: https://github.com/kubeflow/katib/issues/841#issuecomment-537413455

Oct 09 '19 06:10 johnugeorge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 25 '20 00:11 stale[bot]

/lifecycle frozen

Nov 25 '20 03:11 andreyvelich

katib katib copied to clipboard

Use Kubeflow metadata for metrics collection

katib
katib copied to clipboard