flyte Add multi file error aggregation strategy

Tracking issue

Part of RFC https://github.com/flyteorg/flyte/pull/5598

Why are the changes needed?

Deterministic error propagation, see: https://github.com/flyteorg/flyte/pull/5598

What changes were proposed in this pull request?

An error aggregation strategy was introduced for the remove file output reader.

This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

Get this PR in, start using the new error retriever for PyTorch
Get the entry point changes in, for creating multiple error files when environment variables are present

How was this patch tested?

Unit testing.

Check all the applicable boxes

[ ] I updated the documentation accordingly.
[X] All new and existing tests passed.
[X] All commits are signed-off.

Related PRs

https://github.com/flyteorg/flyte/pull/5741
https://github.com/flyteorg/flyte/pull/5616
https://github.com/flyteorg/flytekit/pull/2797

Oct 01 '24 21:10 bgedik

Codecov Report

Attention: Patch coverage is 60.00000% with 90 lines in your changes missing coverage. Please review.

Project coverage is 36.85%. Comparing base (636cc23) to head (d4846f4). Report is 131 commits behind head on master.

Files with missing lines	Patch %	Lines
...uginmachinery/ioutils/remote_file_output_reader.go	63.80%	47 Missing and 12 partials :warning:
flyteidl/gen/pb-go/flyteidl/core/errors.pb.go	0.00%	10 Missing :warning:
flyteidl/gen/pb-go/flyteidl/core/execution.pb.go	0.00%	10 Missing :warning:
...lyteplugins/go/tasks/pluginmachinery/k8s/plugin.go	50.00%	4 Missing :warning:
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	85.71%	2 Missing and 1 partial :warning:
...er/pkg/controller/nodes/task/k8s/plugin_manager.go	50.00%	2 Missing and 1 partial :warning:
flytestdlib/storage/stow_store.go	80.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5795      +/-   ##
==========================================
+ Coverage   36.81%   36.85%   +0.04%     
==========================================
  Files        1310     1310              
  Lines      131034   131217     +183     
==========================================
+ Hits        48237    48361     +124     
- Misses      78611    78658      +47     
- Partials     4186     4198      +12

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`54.02% <ø> (-0.03%)`	:arrow_down:
unittests-flytecopilot	`11.73% <ø> (ø)`
unittests-flytectl	`62.44% <ø> (+0.04%)`	:arrow_up:
unittests-flyteidl	`6.92% <0.00%> (-0.01%)`	:arrow_down:
unittests-flyteplugins	`53.86% <65.62%> (+0.22%)`	:arrow_up:
unittests-flytepropeller	`42.90% <50.00%> (-0.01%)`	:arrow_down:
unittests-flytestdlib	`55.41% <85.71%> (-0.01%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 01 '24 21:10 codecov[bot]

cc @eapolinario @wild-endeavor

Oct 09 '24 19:10 bgedik

Thank you for working on this!

Few points/questions...

This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.

Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.

How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

What frontend UI changes are needed to make the end to end story work if any?

looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.

1/ This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

Get this PR in, start using the new error retriever for PyTorch

Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)

2/ The IDL changes are low risk IMO. I don't mind merging once it works end to end, but need 2 more PRs for that. How do we mark them as WIP?

3/ Re 'How will this pattern be extended into Agent plugins?' I am not familiar with Agent plugins. Any pointers?

Oct 10 '24 05:10 bgedik

cc https://github.com/flyteorg/flytekit/pull/2797 which is the next step in the process.

Oct 11 '24 21:10 wild-endeavor

@wild-endeavor

How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

The RFC (not merged) currently foresees this to be applicable for K8s plugins in flyteplugins. In my mind, if an agent plugin creates multiple "workers in some external system", it should aggregate the errors of those workers*. But I admittedly know too little about agents at this point. If you disagree, let's discuss this in the RFC/in the next contrib sync?

*Some plugins in flyteplugins can't do that because they manage CRDs (e.g. PyTorchJob) and not the underlying pods and the respective controllers for e.g. PyTorchJob don't do any error aggregation across pods.

What frontend UI changes are needed to make the end to end story work if any?

The RFC says this:

With these changes, flyteadmin's /api/v1/executions/<project>/<domain>/<execution id> endpoint, which today provides the error message to the UI, then also provides the information which worker experienced the root cause error. flyteconsole needs to be modified to show this information.

Oct 17 '24 03:10 fg91