Yuan Tang
Yuan Tang
cc @Jeffwan @gaocegege @johnugeorge @ywskycn @merlintang WDYT? Any objections on bringing this type of errors to Kubeflow CR level? It would be convenient to surface this at CR level status...
Let's use https://github.com/kubeflow/training-operator/issues/1507 to track and discuss separately.
Sounds great to me. This would be a good way to standardize metrics collection. We could also expose some utility methods that operators can use to collect operator-specific custom metrics,...
Hi all, I added a detailed outline of the Prometheus metrics we plan to coverage in common operator in https://github.com/kubeflow/common/pull/77. Please take a look and any feedback would be appreciated.
Agreed. Having a unified interface would make it easier for downstream apps to consume the logs.
cc @ShuhanYan @carmark in case you are interested
Progress are being tracked in individual repos: - MXNet Operator https://github.com/kubeflow/mxnet-operator/issues/66 - MPI Operator https://github.com/kubeflow/mpi-operator/issues/217 - XGBoost Operator https://github.com/kubeflow/xgboost-operator/issues/44
Also here are some good references on criteria, processes, and past exits: https://github.com/kubeflow/community/blob/master/guidelines/application_requirements.md#reference