flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Get data from execution into jupyter

Open wild-endeavor opened this issue 2 years ago • 5 comments

Problem Statement

Users can run tasks currently and have the Flyte memoization service (datacatalog) cache them. The issue I personally find is that it's not easy to then use that data afterwards in iterating. Let's say I can have local access to the appropriate S3 buckets. I run a complicated task or workflow to gather a bunch of data and that result is cached. That's great but now if I want to download that data locally and use it, how do I do that? One could do it, but it's not easy and it's certainly not documented.

We think this is important because we think this exact scenario is common - run something in the cloud to get data, but then the resulting dataset is something that fits on a laptop and they want to play around with it locally.

Goal

No more than five keystrokes or clicks to get the above, assuming a Jupyter notebook is already open.

Misc

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

wild-endeavor avatar May 12 '22 01:05 wild-endeavor

This would be super useful, even for bigger files in the context of hosted notebook environments (e.g., jupyter hub running on ECS, or sagemaker studio) where the downloads are usually pretty fast (S3 -> ec2/ecs). Taking a stab at thinking this through even though I'm still learning some of the flyte fundamentals.

In a perfect world, I could imagine something where you have a button, which copies some python code to your clipboard. You could then paste this into a cell into a notebook (where you have flytekit installed).. this code would import code from flytekit (or a helper function) that supports to_python_value etc. on each flyte type, bringing the data into memory:

image

I could also imagine an easier version of this where you just get the s3 urls/the cp command so you can :

! aws s3 cp blah .

That seems like it would be easier to implement, but also less simple and powerful, especially in the case of more complex flytetypes like the pandas dataframe (for some less technical science/statistics folks I work with that would love this feature, getting back a parquet file they then have to pd.read_parquet would be unfamiliar, plus if the flyte type gets complex they could load the file in wrong).

However, this option means you don't have to have flytekit installed in your notebook environment, which is a small plus (you just need aws cli). Also works more readily for non-python notebooks (e.g., R markdown if people aren't using R magic in a jupyter notebook)

Anyway, for either of these, there'd be:

A. the codegen piece (bash or python, presumably needs to live outside of flytekit, i.e., in flyte admin? or in the console itself?) B. the little bit of console work to add the button etc. C. (maybe) the additions to flytekit to use the flytetypes to deserialize the data using the type transformers.

Need to think more about how part C would work (and learn some more Flyte foundations). Presumably you could include the FlyteIDL Literal map for the outputs in the clipboard python code, this gets passed into the function as a literal string etc. and then load the right Types from flytekit/deserialize?

CalvinLeather avatar Jul 28 '22 01:07 CalvinLeather

Started prototyping this out today.

The friendly client (flytekit.clients.friendly.SynchronousFlyteClient) has a lot of the things needed for this... I think the codegen piece can just output something like:

node_execution_data = c.get_node_execution_data(NodeExecutionIdentifier(node_id='NODE_ID', execution_id=WorkflowExecutionIdentifier(project='PROJECT', domain='DOMAIN', name='NAME')))

and then maybe unpack the node_execution_data.full_outputs.

Working on testing this now, this is my first time using the client directly (i.e., not via cli) so working through some configuration issues to test it out.

CalvinLeather avatar Aug 08 '22 11:08 CalvinLeather

Okay I have a working first pass on this running locally in a notebook. I did run into one interesting issue. First, will describe the implementation plan as it now stands after some prototyping:

  1. Console has a new button on the inputs/outputs tab (e.g. "copy python code to clipboard")
  2. FlyteAdmin does some codegen for this, basically just templating out a single import + python call to a new flytekit function data = console_transfer_data(project='WHATEVER', domain='WHATEVER, execution_id='WHATEVER', node_id='WHATEVER', platform_config={'host'="WHATEVER"}) I could send raw pb messages instead for the various identifiers and maybe platform config (not sure if an IDL message exists for that), that could simplify the templating
  3. This function (console_transfer_data) gets added to flytekit. It: a. Creates a SynchronousFlyteClient from the platform_config b. Grabs the node execution data b. Uses TypeEngine.literal_map_to_kwargs to get all data into memory, and then returns it

Okay, the interesting issue:

I do

from flytekit.models.core.identifier import WorkflowExecutionIdentifier, NodeExecutionIdentifier, Identifier, ResourceType
# name is just the workflow execution ID, part of the URL in the flyte console
workflow_execution_id = WorkflowExecutionIdentifier(project='relative-finder', 
                                                   domain='development', 
                                                   name='atrthktqrx6hwl87gd7x')
# The node_id appears a bit harder to get (but you can get it via network tab)
# Note that it appears the more huamn readable version e.g., the one you can
# set w/ with_overrides() , doesn't work here (e.g., n0)
node_execution_id = NodeExecutionIdentifier(node_id='fnugq4pi', 
                                            execution_id=workflow_execution_id)

node_execution = c.get_node_execution(node_execution_id)
node_execution_data = c.get_node_execution_data(node_execution_id)

task = c.get_task(Identifier(resource_type=ResourceType.TASK,
                             project='relative-finder', 
                               domain='development', 
                               name='relative_finder.workflows.backfill_tasks.backfill_top_relatives',
                            version='d84cdd3b09218fe64c875bb6725368d3ddfb6d14'))

literal_map_inputs = node_execution_data.full_inputs
literal_map_outputs = node_execution_data.full_outputs

To grab the literal maps of the inputs and outputs (and this is working nicely for me)

I then use

guessed_types = TypeEngine.guess_python_types(task.closure.compiled_task.template.interface.outputs)

data = TypeEngine.literal_map_to_kwargs(FlyteContextManager.current_context(), 
                                 literal_map_outputs, 
                                 guessed_types
                                )

To convert this to a dictionary of actual python values.

The funky thing is that if I leave guessed_types as it comes back, I get back:

{'added_top_relatives': StructuredDataset(uri=None, file_format='parquet'),
 'new_top_relatives': StructuredDataset(uri=None, file_format='parquet'),
 'old_top_relatives': StructuredDataset(uri=None, file_format='parquet')}

(note how the uri is set to None).

However, if I forcibly set the type of any structuredataset to pandas dataframes like so:

guessed_types_fixed = {k: pd.DataFrame if v==StructuredDataset else v for k, v in guessed_types.items()}

It then works and returns the pandas dataframes in memory.

If I call data['old_top_relatives'].dataframe, I get none back (i.e., the literal map -> structured dataset conversion is loosing the uris, it doesn't just seem like the StructuredDataset object is printing incorrectly etc.

I will investigate a bit more later this week, but lmk if anyone has thoughts on why this is happening. Can share the full notebook if that would help, just don't want to do here just yet since I'd (out of paranoia) want to clean up some URLS/path etc. I don't want floating around publicly (even if they are for private subnet resources).

The other thing I have to figure out is how to share the PlatformConfig in a general enough way to work with various security settings (I've been testing in our dev environment with security disabled).

CalvinLeather avatar Aug 08 '22 13:08 CalvinLeather

Hey sorry for the delay! And oops - should've told you about FlyteRemote which is not nearly documented enough.

Buttons/UI/Copying

I love the idea of having buttons that do copying. But I will let @kumare3 and @jsonporter decide if we should do it and what it should look like. Maybe we can have our ux designer chime in a bit. Ideally there'd be a few places where we can copy things.

Execution Level

At the execution level, we should be able to pop up a box that has Python (and some other languages in the future) that has something like the following (the fact that I can't write this snippet without referencing code shows why we should have it, or why it should be made simpler).

from flytekit.remote.remote import FlyteRemote
r = FlyteRemote(<params as determined by admin>)
e = r.fetch_execution(<params as determined by the relevant execution>)

I think it's FlyteRemote(Config.from_endpoint("playground.hosted.unionai.cloud")) which I think is a bit awkward. Should probably be a helper "endpoint" kwarg on the FlyteRemote class directly.

I/O Level

FlyteRemote

I/O for nodes is indeed a bit hard to get to, but I'm not sure what's the easiest for the end user to deal with. Esp in an execution with lots of nodes, or with nodes with long names, it's probably worth it to help out the user somehow. Maybe this is additional code that can be copied that relies on the above execution object being present.

Quick note though. In the past, we found it more of a burden on the user if we tried to "guess" the Python type and extract the value for the user. This is why we moved to the LiteralsResolver object. This allows the user to decode the objects as they wish. Once when you get into objects that don't map one to one (like files and dataframes), it makes more sense to let the user specify the Python type.

Command line

So Flyte tasks read and write an inputs.pb and an outputs.pb (or in error cases an error.pb) - I think it's probably too much to offer the user to download these raw file. These are serialized LiteralMaps, which users shouldn't have to think about. But I think we should offer a direct download option for off-loaded types (Files, Directories, Schema, StructuredDataset). This makes sense.

Structured Dataset

A StructuredDataset is a wrapper object that when constructed, will not immediately download. That only happens when you call .open().all(). The uri field is there primarily for when you are returning a dataframe (like returning the result of a task) but you want to specify exactly where it gets uploaded to. I don't remember if I tested the other way around... using the uri to construct a new object pointing at an existing file. If this doesn't work, we should make it work. Typically when a StructuredDataset python object is created from an existing StructuredDataset Literal, the literal gets stored here which is how the decoder later discovers where it is.

wild-endeavor avatar Aug 09 '22 21:08 wild-endeavor

Forgot to mention the most important bit - without getting too far off track, i use case I'd really love to see is not only creating a python object that referenced any output, esp cached output, but that object being passable to a new execution using FlyteRemote. r.execute(some_wf, inputs={"a": <the output data object>})

wild-endeavor avatar Aug 09 '22 21:08 wild-endeavor

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Aug 30 '23 00:08 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 07 '23 00:09 github-actions[bot]

this is indeed supported now - check out pyflyte fetch

kumare3 avatar Dec 22 '23 20:12 kumare3