Epic: improve experiment logs
Summary / Background
Provide robust logging for experiment runs.
Scope
When running any experiment, save logs of the output, errors, hardware usage, time ,etc. Be able to retrieve this anytime/anywhere for any experiment, including sharing between users and product (DVC, VS Code, Studio).
Assumptions
- Only for pipeline execution (not about dvclive-only experiments)
Open Questions
- How do we share the logs?
- Should we share live log updates to Studio?
Blockers / Dependencies
- Can we make it a joint effort with VS Code and Studio teams? Seems like it would be powerful in Studio for workflows like cloud experiments.
General Approach
We already have dvc queue logs. For sharing, we could add dvc queue push/pull or support dvc push/pull --logs
Steps
Phase 1: Make logging work for all experiments
- [x] https://github.com/iterative/dvc/issues/9425
- [ ] https://github.com/iterative/dvc/issues/9616
- [ ] https://github.com/iterative/dvc/issues/9174
- [ ] https://github.com/iterative/dvc/issues/8658
- [ ] https://github.com/iterative/dvc/issues/9079
Phase 2: Expand and share logs
- [ ] https://github.com/iterative/dvc/issues/8483
- [ ] Time each stage took to execute
- [ ] Hardware usage and type - number of CPUs/GPUs and their usage, same with memory
Timelines
TBD (not yet prioritized)
Discussed in #9425 that the current dvc queue logs command won't make sense if we want to capture logs for non-queued experiments. Now that we have dropped checkpoints, do we still need a separate queue command or can we merge it with exp?
Looking through the current queue commands:
-
start: could be inexp run --run-allorexp start -
stop: could be inexp stop -
status: is it needed? if so, can it be inexp status? -
logs: could be inexp logs -
remove: is it needed? this also might depend on whether/how we plan to preserve the logs; some info could be auto-deleted onexp clean -
kill: could be inexp kill