katib
katib copied to clipboard
Dedicated logs tab for Trials
/kind feature
Describe the solution you'd like As part of https://github.com/kubeflow/katib/issues/1763, and https://github.com/kubeflow/katib/issues/1745, lets us this issue to discuss how to expose the logs for a Trial.
Looking at the docs there are many different types of workers for a Trial https://www.kubeflow.org/docs/components/katib/trial-template/#custom-resource. The K8s clients can allow the backend to fetch logs from a Pod. With this , the main question I have is how can the backend find the Pod for a specific Trial worker type?
- Will there always be an annotation/label that reflects the Trial that the Pod belongs to?
- Is this the case for all worker types?
- What could we show when we have an Argo Workflow?
I suggest that we handle each worker type separately. We can start with K8s Jobs and/or TFJobs and start adding more later on.
@d-gol @andreyvelich @johnugeorge @gaocegege
Love this feature? Give it a 👍 We prioritize the features with the most 👍
@kimwnasptd thank you for creating this issue and organizing the effort!
Let me try to answer your questions to the best of my knowledge:
Will there always be an annotation/label that reflects the Trial that the Pod belongs to?
Yes. There is a label job-name
in every pod which belongs to a specific trial - https://github.com/kubeflow/common/blob/v0.4.1/pkg/apis/common/v1/interface.go#L30
job-name
is equal to the trial-name
- https://github.com/kubeflow/katib/blob/release-0.13/pkg/webhook/v1beta1/pod/inject_webhook.go#L120
In the case of an MPI job, the label is mpi-job-name
- https://github.com/kubeflow/training-operator/blob/v1.4-branch/pkg/controller.v1/mpi/mpijob.go#L46
This allows us to obtain all pods belonging to various Trial types (Job
, TFJob
, PyTorchJob
...)
Is this the case for all worker types?
Yes for Job
, TFJob
, PyTorchJob
, MXJob
, XGBoostJob
and MPIJob
.
For Pipelines, I would expect the same, but I would first focus on getting the logs from the other CRDs since Pipeline logs can already be accessed on the Run page.
What could we show when we have an Argo Workflow?
We could show a link (button) that would route to the Pipelines Run page with logs, as in trials table - https://github.com/kubeflow/katib/blob/release-0.13/pkg/new-ui/v1beta1/frontend/src/app/pages/experiment-details/trials-table/trials-table.component.html#L34 Later, we could technically parse responses from the pipeline Run page and show them in Katib UI as well. For now, since the logs already exist elsewhere, think that a link would be enough.
Please correct me if I'm wrong and share your ideas.
I would like to work on this, if no one else is working on it right now!
Thanks @elenzio9
/assign @elenzio9
Fixes: #971 as well
We realized with @kimwnasptd that we need to extend the backend by adding a new route for the LOGS
tab which:
- Will get a Trial name/namespace
- Fetch the underlying Pod name based on the
job-name
label - Return the logs of that Pod
Also, we saw that if the Trial does not have retain: true
then the underlying CRs will not be persisted and thus there won't be any Pods to gather logs from.
We can make the frontend to show a message for this though, to help users understand why they don't see logs for a Trial.
Unfortunately I'm not very familiar with Golang and the backend, so I can't help much there. But would really like to help with the frontend work once we have such a route for the Trial logs!
Hi @elenzio9, thank you for your efforts with this!
I did start some work on it a while ago, but didn't finish. Here you can see the functions for extracting logs from all pods from a specific trial: https://github.com/d-gol/katib/blob/64ac7034d81faeeb7a554417f47a0c7c445c3d72/pkg/new-ui/v1beta1/backend.go#L421 and https://github.com/d-gol/katib/blob/cf7106dfae69e66f39582f1ef981ed65a208732b/pkg/util/v1beta1/katibclient/katib_client.go#L230
The reason it's checking multiple pods is because the trial can also be a TFJob
, PytorchJob
, or any other training operator CR. So we need to fetch logs from each worker (pod), and also find a way to show them nicely in the UI.
Is this something that would be useful for you? If so, I can rebase my changes and take care of the backend. We can even submit separate PRs later. If you can take of the frontend part, that would be amazing!
That would be great @d-gol
Btw, do you need logs from each pod? We just need logs from Master right?
@johnugeorge sure, we can get logs only from the master. I can implement that, submit a PR, and then later if needed we can obtain logs from all workers.
@kimwnasptd @elenzio9 Since backend changes for Trial logs have been merged: https://github.com/kubeflow/katib/pull/2039, are we going to make changes in the frontend to see the logs ? Do we know if we have bandwidth to implement it before Katib 0.15 release ?
@elenzio9 Are planning in this release ?
@johnugeorge @andreyvelich I'm working on it right now, and I'll send the PR as soon as possible.
This was implemented, thank you for your contributions!