Add multi file error aggregation strategy
Tracking issue
Part of RFC https://github.com/flyteorg/flyte/pull/5598
Why are the changes needed?
Deterministic error propagation, see: https://github.com/flyteorg/flyte/pull/5598
What changes were proposed in this pull request?
An error aggregation strategy was introduced for the remove file output reader.
This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.
My plan was the following in terms of order:
- Get this PR in, start using the new error retriever for PyTorch
- Get the entry point changes in, for creating multiple error files when environment variables are present
How was this patch tested?
Unit testing.
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [X] All new and existing tests passed.
- [X] All commits are signed-off.
Related PRs
- https://github.com/flyteorg/flyte/pull/5741
- https://github.com/flyteorg/flyte/pull/5616
- https://github.com/flyteorg/flytekit/pull/2797
Codecov Report
Attention: Patch coverage is 60.00000% with 90 lines in your changes missing coverage. Please review.
Project coverage is 36.85%. Comparing base (
636cc23) to head (d4846f4). Report is 131 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #5795 +/- ##
==========================================
+ Coverage 36.81% 36.85% +0.04%
==========================================
Files 1310 1310
Lines 131034 131217 +183
==========================================
+ Hits 48237 48361 +124
- Misses 78611 78658 +47
- Partials 4186 4198 +12
| Flag | Coverage Δ | |
|---|---|---|
| unittests-datacatalog | 51.58% <ø> (ø) |
|
| unittests-flyteadmin | 54.02% <ø> (-0.03%) |
:arrow_down: |
| unittests-flytecopilot | 11.73% <ø> (ø) |
|
| unittests-flytectl | 62.44% <ø> (+0.04%) |
:arrow_up: |
| unittests-flyteidl | 6.92% <0.00%> (-0.01%) |
:arrow_down: |
| unittests-flyteplugins | 53.86% <65.62%> (+0.22%) |
:arrow_up: |
| unittests-flytepropeller | 42.90% <50.00%> (-0.01%) |
:arrow_down: |
| unittests-flytestdlib | 55.41% <85.71%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
cc @eapolinario @wild-endeavor
Thank you for working on this!
Few points/questions...
- This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.
- Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.
- How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?
What frontend UI changes are needed to make the end to end story work if any?
looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.
1/ This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.
My plan was the following in terms of order:
Get this PR in, start using the new error retriever for PyTorch
- Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
- Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)
2/ The IDL changes are low risk IMO. I don't mind merging once it works end to end, but need 2 more PRs for that. How do we mark them as WIP?
3/ Re 'How will this pattern be extended into Agent plugins?' I am not familiar with Agent plugins. Any pointers?
cc https://github.com/flyteorg/flytekit/pull/2797 which is the next step in the process.
@wild-endeavor
- How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?
The RFC (not merged) currently foresees this to be applicable for K8s plugins in flyteplugins. In my mind, if an agent plugin creates multiple "workers in some external system", it should aggregate the errors of those workers*. But I admittedly know too little about agents at this point. If you disagree, let's discuss this in the RFC/in the next contrib sync?
*Some plugins in flyteplugins can't do that because they manage CRDs (e.g. PyTorchJob) and not the underlying pods and the respective controllers for e.g. PyTorchJob don't do any error aggregation across pods.
What frontend UI changes are needed to make the end to end story work if any?
The RFC says this:
With these changes, flyteadmin's
/api/v1/executions/<project>/<domain>/<execution id>endpoint, which today provides the error message to the UI, then also provides the information which worker experienced the root cause error.flyteconsoleneeds to be modified to show this information.