flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Add Outputs() as the idiom for multiple outputs, to avoid user confusion on NamedTuple-vs-dataclass

Open jdanbrown opened this issue 2 years ago • 4 comments

Motivation: Why do you think this is important?

(Copying out a feature request from slack, hopefully I captured enough context here)

I now understand the (important) distinction between dataclasses and NamedTuple in flytekit:

  • NamedTuple : output :: kwargs : input — this makes a lot of sense, because you want multiple outputs, with names
  • NamedTuple cannot be used as a datatype, only as a return type to mean "multiple outputs"
  • dataclass is a normal datatype that you can use wherever you like. it doesn't mean "multiple outputs"

What's confusing is the way a python user declares a datatype using @dataclass or NamedTuple is basically the same:

@dataclass
class Config:
    epochs: int
    cv_splits: int
    ...

class Config(NamedTuple):
    epochs: int
    cv_splits: int
    ...

So it's a very easy trap to think they're basically interchangeable and then get confused and/or frustrated when flyte behaves in very different ways depending on which one you used. My team is trying to anticipate adding the rest of our team as flyte users (~8 people), as well as handfuls more teams (~5–10 teams) as users, and we think this is an important friction to get ahead of and have a simple recommendation and happy path for.

Goal: What should the final outcome look like, ideally?

Library pseudocode

  • Define this once
  • Document/explain to users to use Outputs instead of NamedTuple for task outputs
# Outputs is like NamedTuple except:
#   - It fills in the type name for you -- it's a nuisance parameter, and flyte ignores it
#   - You use it inline instead of inheriting from it, for both type and value usages
#   - TODO Add metaclass stuff to make this code actually work as a type (and a value)
Outputs = lambda **kwargs: NamedTuple("Outputs", **kwargs)

Example user code:

from wherever import Outputs

@dataclass
def Config:
    ...

@dataclass
def TrainStats:
    ...

@task
def evaluate_model(
    config: Config,         # A user-defined dataclass
    model: tf.keras.Model,  # Some type from a library
    metrics: List[str],     # A normal python datatype
) -> Outputs(               # Use Outputs() inline as a type instead of declaring a NamedTuple
    success: bool,          # A normal python datatype
    stats: TrainStats,      # A user-defined dataclass
    thresholds: np.ndarray  # Some type from a library
):
    ...
    return Outputs(         # Also use Outputs() as a value, matching the type above
        success=...,
        stats=...,
        thresholds=...,
    )

# Simple tasks can ofc still return single outputs too
#   - With no name, i.e. flyte's default o1 naming
@task
def sample_train_data(X: pd.DataFrame) -> pd.DataFrame:
    ...

Describe alternatives you've considered

.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

jdanbrown avatar Jul 26 '22 17:07 jdanbrown

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Jul 26 '22 17:07 welcome[bot]

Didn't get a chance to keep up with the Slack thread in real time. If I'm understanding this, the issues are:

  • When users write tasks and workflows, oftentimes it's useful to be able to think of inputs/outputs as a group.
  • The current documentation doesn't make it clear that flytekit uses NamedTuples to resolve the "output-naming" problem, not the "output-bundling" problem. Lack of guidance here leads users astray.
  • The documentation in general is not clear about what dataclasses are, how they map to a Flyte entity's interface, when to use them, why they need to be dataclass_json, why these are fundamentally different from NamedTuples.

The proposal is to create a new construct that will make it obvious to the user what is happening. Can I suggest the following?

  • Is the best name for this construct Outputs? Will that be confusing because users will return this from a task or workflow but then use it as in input to a subsequent task. Something like FlyteIOGroup or NamedIO? Happy to use whatever you guys decide.
  • Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:
    • Step 1. Create the type MyGroup = Outputs(a=int, b=str)
    • Step 2. In task or workflow return MyGroup(3, "hi")
  • Complete https://github.com/flyteorg/flyte/issues/1337 under the guise of this new construct.
  • Add support alongside the existing NamedTuple support for this new type.
  • Paraphrase the slack thread into documentation.

wild-endeavor avatar Jul 27 '22 21:07 wild-endeavor

@wild-endeavor Wait, I think you're missing a key high-level idea here:

  • I'm proposing that NamedTuples/Outputs should only be used for "multiple/named outputs", and not for anything about inputs
    • For inputs, we already have python kwargs for "multiple/named inputs"
    • And for structs/datatypes (arbitrarily nested) in inputs and outputs, we already have dataclasses

So as a consequence of that:

Complete https://github.com/flyteorg/flyte/issues/1337 under the guise of this new construct.

  • I'd propose that https://github.com/flyteorg/flyte/issues/1337 should not be supported, since that ask is stumbling over exactly the confusion about NamedTuples-vs-dataclasses that motivated my thinking above
  • My proposal is that flyte should make a special thing called Outputs (or NamedOutputs, etc.) in place of NamedTuple, and otherwise discourage the use of NamedTuple entirely
  • Hopefully my example above illustrates this

Is the best name for this construct Outputs? Will that be confusing because users will return this from a task or workflow but then use it as in input to a subsequent task. Something like FlyteIOGroup or NamedIO? Happy to use whatever you guys decide.

  • Yes, I think Outputs (or maybe NamedOutputs) is an appropriate name because this is intended only for outputs, not inputs
    • That is, an Outputs should always be unpacked and then its components should be passed around
    • In cases where returning and unpacking an Outputs isn't helpful, the user should just return a single datatype (e.g. a python primitive, a dataclass, etc.) and pass that into downstream tasks directly
  • Further, to avoid user confusion I'd propose that flyte tasks/workflows raise an error if Outputs is seen as an input, since that means the user made a mistake or is confused about usage

Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:

  • Step 1. Create the type MyGroup = Outputs(a=int, b=str)
  • Step 2. In task or workflow return MyGroup(3, "hi")
  • I think having to name the Outputs(a=int, b=str) for every task/workflow is toilsome, and Outputs should support an inline usage like in my example above
  • Having to name the Outputs(...) type for every task would I think just result in a boilerplate pattern of FooOutputs = Outputs(...) and def foo(): FooOutputs, which would add zero clarity at the cost of non-zero indirection and separate top-level definitions

Add support alongside the existing NamedTuple support for this new type.

  • Overlapping with above, discourage and/or disallow NamedTuple, to avoid user confusion

wdyt?

jdanbrown avatar Aug 02 '22 22:08 jdanbrown

Can we keep the same usage pattern as NamedTuples? That is, using them will be a two step process:

  • Step 1. Create the type MyGroup = Outputs(a=int, b=str)
  • Step 2. In task or workflow return MyGroup(3, "hi")
  • I think having to name the Outputs(a=int, b=str) for every task/workflow is toilsome, and Outputs should support an inline usage like in my example above
  • Having to name the Outputs(...) type for every task would I think just result in a boilerplate pattern of FooOutputs = Outputs(...) and def foo(): FooOutputs, which would add zero clarity at the cost of non-zero indirection and separate top-level definitions

Important caveat: I have no idea if the inline usage of e.g. Outputs(a=int, b=str) as a type is actually possible in python. :) If that doesn't work, then 👍 I guess the two-definition style is the only option.

jdanbrown avatar Aug 02 '22 22:08 jdanbrown

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Oct 21 '23 00:10 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Oct 29 '23 00:10 github-actions[bot]

@jdanbrown you don't care that Outputs is a NamedTuple right? As long as it behaves in the way that you described that will be okay?

We're looking at adding a top level outputs object to the IDL. I think it's the right time to revisit this.

wild-endeavor avatar Oct 29 '23 23:10 wild-endeavor

@wild-endeavor Yeah! Don't care if Outputs is a NamedTuple or whatever else, just the described behavior.

jdanbrown avatar Mar 07 '24 09:03 jdanbrown