skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Better handling of cases where a deferred function returns more than one value

Open rcap107 opened this issue 4 months ago • 4 comments

Following up on a question on the skrub discord.

I have a function like this, which returns more than one value

test = skrub.var("test", [1,2])

@skrub.deferred
def process_test_data(test):
    left = test[0]
    right = test[1]
    return left, right

I cannot unpack the result directly because an exception is raised:

left, right  = test.skb.apply_func(process_test_data)

gives

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 2
      1 # %%
----> 2 left, right  = test.skb.apply_func(process_test_data)

File ~/Projects/work/skrub/skrub/_data_ops/_data_ops.py:593, in DataOp.__iter__(self)
    592 def __iter__(self):
--> 593     raise TypeError(
    594         "This object is a DataOp that will be evaluated later, "
    595         "when your learner runs. So it is not possible to eagerly "
    596         "iterate over it now."
    597     )

TypeError: This object is a DataOp that will be evaluated later, when your learner runs. So it is not possible to eagerly iterate over it now.

Instead, I have to assign the result to a different variable, then unpack that:

res = test.skb.apply_func(process_test_data)
left = res[0]
right = res[1]  

combine = left + right
combine

How should we handle this use of deferred functions?

  • We could leave the functionality as is, explaining in the user guide how to unpack the returned tuple
  • We could modify deferred so that returning more than one value wraps each value into a DataOp, so that the resulting tuple can be unpacked directly.

I don't remember if deferred functions are intended to return only one value, or if we simply have never prepared examples with one single value.

rcap107 avatar Aug 24 '25 15:08 rcap107

Expanding on this after discussing with other devs.

Unpacking means iterating and assigning values. Iterating is not possible on Data Ops because a Data Op cannot know what it is iterating on until it is evaluated. The same problem would happen with something like n, p = skrub.X().shape or a, b = skrub.var('t', (1, 2))

The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround.

rcap107 avatar Aug 25 '25 12:08 rcap107

The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround.

Can we have a good error message?

GaelVaroquaux avatar Aug 25 '25 12:08 GaelVaroquaux

The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround. Can we have a good error message?

Yes, absolutely

rcap107 avatar Aug 25 '25 13:08 rcap107

Maybe it could be reworded a bit but we already have a dedicated error message for this:

>>> import skrub
>>> iter(skrub.var('a'))
Traceback (most recent call last):
    ...
TypeError: This object is a DataOp that will be evaluated later, when your learner runs. So it is not possible to eagerly iterate over it now.

jeromedockes avatar Dec 06 '25 18:12 jeromedockes

This issue can be closed once we add a small paragraph to the control_flow.rst file in the documentation that explains the issue a bit. There should be:

  • a brief explanation of the situation
  • how to address it
  • the snippet of code I wrote above as an example

This should give the users an idea of how to deal with this situation.

rcap107 avatar Dec 15 '25 15:12 rcap107