(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..)
Is your feature request related to a problem?
Hi,
I'd like to see a simplified way to work with multiple columns in, multiple columns out.
One of the more pythonic approaches I've seen is to use dict[str, np.ndarray] -> dict[str, np.ndarray] (alternatively dict[str, Any]).
This approach is taken by Ray (map_batches) and HuggingFace Datasets (map)
Why is this important for Deep Learning?
When working with tasks such as Object Detection you need to transform the Bounding Box and Image the same way. Transforming could be done "in parallell", cumbersome but possible. It turns into a big problem when it comes to Augmenting data... Augmentation is commonly done with a probability p to be applied, and what is applied is also random (e.g. RandomCrop, RandomRescale, MixUp, ...). This means that the augmentation has to be applied exactly the same to both BBox and Image. Only way I see this is possible now is through building a struct, possible but not pythonic.
P.S. It's great that batch_size is already enabled as batched transforms are excellent for certain augmentations, e.g. MixUp.
Describe the solution you'd like
A multi-input, multi-output API for UDF's
Describe alternatives you've considered
I've thought of using struct but it's not as smooth as the more "pythonic" approach of using dict.
Wondering what your idea is.
Additional Context
import albumentations as A
transforms = A.Compose([
A.RandomResizedCrop(size=(224, 224), antialias=True),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["category_id"]))
transforms(**sample)
is how albumentations is applied, where sample is a dict of values. transforms takes kwargs.
A guide on how Albumentations to use with HF Datasets.
Would you like to implement a fix?
Maybe, if you guide me I could try to get it done during the weekend.
Do you have any thoughts @kevinzwang ?
I think this is a good idea. I'm thinking we could use struct under the hood, but provide some nice abstractions over it to make the udf experience as seamless as possible.
Hi @Lundez, thanks bringing this up. I have a few questions:
- you should already be able to construct UDFs with multiple inputs by simply adding more arguments to your UDF. Does that work for you?
- it's true that UDFs don't have a great mechanism for outputting multiple values at the moment. Is there an interface that you would like to propose for this? The workaround at the moment that we recommend is returning a struct dtype as a list of dictionaries in your UDF. Then, you can expand the struct fields with
col("struct_col.*").
Here's a quick example of doing multi-input multi-output with the things I mentioned above:
>>> import daft
>>> @daft.udf(return_dtype=daft.DataType.struct({
... "x": daft.DataType.int64(),
... "y": daft.DataType.int64(),
... }))
... def my_udf(a, b):
... # simple UDF that just returns the two inputs as a struct column
... result = []
... for a_elem, b_elem in zip(a.to_pylist(), b.to_pylist()):
... result.append({"x": a_elem, "y": b_elem})
... return result
...
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> # call UDF
>>> df = df.select(my_udf(df["a"], df["b"]).alias("udf_result"))
>>> # unnest struct fields
>>> df = df.select("udf_result.*")
>>> df.show()
╭───────┬───────╮
│ x ┆ y │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1 ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 6 │
╰───────┴───────╯
(Showing first 3 of 3 rows)
What we could maybe to is also support returning a dictionary of lists instead of a list of dictionaries for struct type columns.
Hi,
I know it's technically possible to do right now (as I noted with my comment regarding struct). If it's how you prefer the DX to be I'm fine.
I'm merely suggesting adding another way that feels easier to work with, which could potentially help adoption.
The col("struct.*") syntax was quite cool, though the ".unnest()" approach seems clearer (IMO).
Feel free to close issue if you're happy with the state of today 👍
Ah I see, thanks for the feedback. I do want to get around to improving the ergonomics of UDFs, I think we'll have some time after the new years to flesh it out. Will keep this issue open for others in the community to voice their thoughts too.
Here's my proposal:
- Add something like an
unnest_outputparameter in@daft.udfthat tells Daft to automatically convert a struct output into columns - more ways to return struct type arrays (in particular, dict of list)
- a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.
@jaychia do you have any thoughts?
Here's my proposal:
- Add something like an
unnest_outputparameter in@daft.udfthat tells Daft to automatically convert a struct output into columns- more ways to return struct type arrays (in particular, dict of list)
- a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.
@jaychia do you have any thoughts?
I like this. And regarding selectors from polars, those are exceptional. Great idea to add!
Hi @jaychia @kevinzwang, in our application scenario, there is also a strong demand for UDF to support multi-columns output. I hope to lead the design and implementation of this work. Do you think it's okay? If yes, I will provide a version of the interface and technical implementation design as soon as possible (expected this week), and lead the completion of the corresponding development work.
I will defer to @kevinzwang on this as I know there is a new UDF design proposal underway
@kevinzwang shall we post a Github discussion or similar once we arrive at a good design so that @plotor can perhaps add suggestions onto the discussion?
Yep, we are thinking about this actively and will have something to show yall soon.
Yep, we are thinking about this actively and will have something to show yall soon.
@kevinzwang Is there a corresponding time node? We hope to use this feature as soon as possible, because in our application scenarios, there are still many cases where UDF needs to support the output of multiple columns. The current method based on returning struct is not very flexible.
In addition, if possible, you can split some development tasks for me, I would be happy to participate in it.
I'm aiming to stabilize an API for this by the end of the month, but we may be able to get something to you to try out next week. Would love to have you in the loop and get your early feedback on these APIs!
The current method based on returning
structis not very flexible.
Could you elaborate on that? How do you currently do multi-column outputs, and what makes it inflexible for you? Also, do you have any suggestions for an API that would work well for your use case?
Additional discussion captured in #4820
@plotor could you please shared more about limitations with returning struct there? Thanks!
Additional discussion captured in #4820
@plotor could you please shared more about limitations with returning struct there? Thanks!
@rchowell Sorry for the late reply. I asked our business team, and they think the introduction of unnest has made things much more convenient. Previously, they complained about having to flatten a struct into multiple columns, as shown below. However, there are also complaints about unnest, primarily because it currently flattens all columns in a struct. Is it possible to support flattening only specific columns?
@plotor could you elaborate on your use case for flattening specific columns? Would you still want some of the struct fields to still be preserved in a struct? Additionally, I'm wondering if you'd like to specify the columns to include or columns to exclude.
@plotor could you elaborate on your use case for flattening specific columns? Would you still want some of the struct fields to still be preserved in a struct? Additionally, I'm wondering if you'd like to specify the columns to include or columns to exclude.
@kevinzwang I rethought it and realized that using unnest to flatten all columns, combined with select or exclude, could solve the problem. Perhaps directly supporting flattening partial columns at the unnest level would provide a better user experience.
However, back to the question:
-
Some of the remaining columns in the
structmay be used later, but since they will be used later, I believe they should be flattened initially, so there's no need to retain the remaining columns. -
Excluding some columns during flattening is necessary, especially when the UDF returns a large number of columns.
Gotcha. Since this isn't blocking functionality and is largely an ergonomics issue, I don't plan on prioritizing a fix at the moment, but we'll keep this in mind as we design APIs to better work with nested types like this. Thanks for the input!