flyte
flyte copied to clipboard
[Core feature] Add Outputs() as the idiom for multiple outputs, to avoid user confusion on NamedTuple-vs-dataclass
Motivation: Why do you think this is important?
(Copying out a feature request from slack, hopefully I captured enough context here)
I now understand the (important) distinction between dataclasses and NamedTuple in flytekit:
- NamedTuple : output :: kwargs : input — this makes a lot of sense, because you want multiple outputs, with names
- NamedTuple cannot be used as a datatype, only as a return type to mean "multiple outputs"
- dataclass is a normal datatype that you can use wherever you like. it doesn't mean "multiple outputs"
What's confusing is the way a python user declares a datatype using @dataclass
or NamedTuple
is basically the same:
@dataclass
class Config:
epochs: int
cv_splits: int
...
class Config(NamedTuple):
epochs: int
cv_splits: int
...
So it's a very easy trap to think they're basically interchangeable and then get confused and/or frustrated when flyte behaves in very different ways depending on which one you used. My team is trying to anticipate adding the rest of our team as flyte users (~8 people), as well as handfuls more teams (~5–10 teams) as users, and we think this is an important friction to get ahead of and have a simple recommendation and happy path for.
Goal: What should the final outcome look like, ideally?
Library pseudocode
- Define this once
- Document/explain to users to use Outputs instead of NamedTuple for task outputs
# Outputs is like NamedTuple except:
# - It fills in the type name for you -- it's a nuisance parameter, and flyte ignores it
# - You use it inline instead of inheriting from it, for both type and value usages
# - TODO Add metaclass stuff to make this code actually work as a type (and a value)
Outputs = lambda **kwargs: NamedTuple("Outputs", **kwargs)
Example user code:
from wherever import Outputs
@dataclass
def Config:
...
@dataclass
def TrainStats:
...
@task
def evaluate_model(
config: Config, # A user-defined dataclass
model: tf.keras.Model, # Some type from a library
metrics: List[str], # A normal python datatype
) -> Outputs( # Use Outputs() inline as a type instead of declaring a NamedTuple
success: bool, # A normal python datatype
stats: TrainStats, # A user-defined dataclass
thresholds: np.ndarray # Some type from a library
):
...
return Outputs( # Also use Outputs() as a value, matching the type above
success=...,
stats=...,
thresholds=...,
)
# Simple tasks can ofc still return single outputs too
# - With no name, i.e. flyte's default o1 naming
@task
def sample_train_data(X: pd.DataFrame) -> pd.DataFrame:
...
Describe alternatives you've considered
.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Thank you for opening your first issue here! 🛠
Didn't get a chance to keep up with the Slack thread in real time. If I'm understanding this, the issues are:
- When users write tasks and workflows, oftentimes it's useful to be able to think of inputs/outputs as a group.
- The current documentation doesn't make it clear that flytekit uses NamedTuples to resolve the "output-naming" problem, not the "output-bundling" problem. Lack of guidance here leads users astray.
- The documentation in general is not clear about what dataclasses are, how they map to a Flyte entity's interface, when to use them, why they need to be dataclass_json, why these are fundamentally different from NamedTuples.
The proposal is to create a new construct that will make it obvious to the user what is happening. Can I suggest the following?
- Is the best name for this construct
Outputs
? Will that be confusing because users will return this from a task or workflow but then use it as in input to a subsequent task. Something likeFlyteIOGroup
orNamedIO
? Happy to use whatever you guys decide. - Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:
- Step 1. Create the type
MyGroup = Outputs(a=int, b=str)
- Step 2. In task or workflow
return MyGroup(3, "hi")
- Step 1. Create the type
- Complete https://github.com/flyteorg/flyte/issues/1337 under the guise of this new construct.
- Add support alongside the existing NamedTuple support for this new type.
- Paraphrase the slack thread into documentation.
@wild-endeavor Wait, I think you're missing a key high-level idea here:
- I'm proposing that NamedTuples/Outputs should only be used for "multiple/named outputs", and not for anything about inputs
- For inputs, we already have python kwargs for "multiple/named inputs"
- And for structs/datatypes (arbitrarily nested) in inputs and outputs, we already have dataclasses
So as a consequence of that:
Complete https://github.com/flyteorg/flyte/issues/1337 under the guise of this new construct.
- I'd propose that https://github.com/flyteorg/flyte/issues/1337 should not be supported, since that ask is stumbling over exactly the confusion about NamedTuples-vs-dataclasses that motivated my thinking above
- My proposal is that flyte should make a special thing called
Outputs
(orNamedOutputs
, etc.) in place ofNamedTuple
, and otherwise discourage the use ofNamedTuple
entirely - Hopefully my example above illustrates this
Is the best name for this construct
Outputs
? Will that be confusing because users will return this from a task or workflow but then use it as in input to a subsequent task. Something like FlyteIOGroup or NamedIO? Happy to use whatever you guys decide.
- Yes, I think
Outputs
(or maybeNamedOutputs
) is an appropriate name because this is intended only for outputs, not inputs- That is, an
Outputs
should always be unpacked and then its components should be passed around - In cases where returning and unpacking an
Outputs
isn't helpful, the user should just return a single datatype (e.g. a python primitive, a dataclass, etc.) and pass that into downstream tasks directly
- That is, an
- Further, to avoid user confusion I'd propose that flyte tasks/workflows raise an error if
Outputs
is seen as an input, since that means the user made a mistake or is confused about usage
Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:
- Step 1. Create the type MyGroup = Outputs(a=int, b=str)
- Step 2. In task or workflow return MyGroup(3, "hi")
- I think having to name the
Outputs(a=int, b=str)
for every task/workflow is toilsome, andOutputs
should support an inline usage like in my example above - Having to name the
Outputs(...)
type for every task would I think just result in a boilerplate pattern ofFooOutputs = Outputs(...)
anddef foo(): FooOutputs
, which would add zero clarity at the cost of non-zero indirection and separate top-level definitions
Add support alongside the existing NamedTuple support for this new type.
- Overlapping with above, discourage and/or disallow NamedTuple, to avoid user confusion
wdyt?
Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:
- Step 1. Create the type MyGroup = Outputs(a=int, b=str)
- Step 2. In task or workflow return MyGroup(3, "hi")
- I think having to name the Outputs(a=int, b=str) for every task/workflow is toilsome, and Outputs should support an inline usage like in my example above
- Having to name the Outputs(...) type for every task would I think just result in a boilerplate pattern of FooOutputs = Outputs(...) and def foo(): FooOutputs, which would add zero clarity at the cost of non-zero indirection and separate top-level definitions
Important caveat: I have no idea if the inline usage of e.g. Outputs(a=int, b=str)
as a type is actually possible in python. :) If that doesn't work, then 👍 I guess the two-definition style is the only option.
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏
Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏
@jdanbrown you don't care that Outputs
is a NamedTuple right? As long as it behaves in the way that you described that will be okay?
We're looking at adding a top level outputs object to the IDL. I think it's the right time to revisit this.
@wild-endeavor Yeah! Don't care if Outputs
is a NamedTuple
or whatever else, just the described behavior.