Better handling of cases where a deferred function returns more than one value
Following up on a question on the skrub discord.
I have a function like this, which returns more than one value
test = skrub.var("test", [1,2])
@skrub.deferred
def process_test_data(test):
left = test[0]
right = test[1]
return left, right
I cannot unpack the result directly because an exception is raised:
left, right = test.skb.apply_func(process_test_data)
gives
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[23], line 2
1 # %%
----> 2 left, right = test.skb.apply_func(process_test_data)
File ~/Projects/work/skrub/skrub/_data_ops/_data_ops.py:593, in DataOp.__iter__(self)
592 def __iter__(self):
--> 593 raise TypeError(
594 "This object is a DataOp that will be evaluated later, "
595 "when your learner runs. So it is not possible to eagerly "
596 "iterate over it now."
597 )
TypeError: This object is a DataOp that will be evaluated later, when your learner runs. So it is not possible to eagerly iterate over it now.
Instead, I have to assign the result to a different variable, then unpack that:
res = test.skb.apply_func(process_test_data)
left = res[0]
right = res[1]
combine = left + right
combine
How should we handle this use of deferred functions?
- We could leave the functionality as is, explaining in the user guide how to unpack the returned tuple
- We could modify
deferredso that returning more than one value wraps each value into a DataOp, so that the resulting tuple can be unpacked directly.
I don't remember if deferred functions are intended to return only one value, or if we simply have never prepared examples with one single value.
Expanding on this after discussing with other devs.
Unpacking means iterating and assigning values. Iterating is not possible on Data Ops because a Data Op cannot know what it is iterating on until it is evaluated. The same problem would happen with something like n, p = skrub.X().shape or a, b = skrub.var('t', (1, 2))
The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround.
The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround.
Can we have a good error message?
The gist of it is that this is a "wontfix" issue because the problem does not have easy solutions. I will add to the user guide a note on this specific circumstance so that people are aware of the workaround. Can we have a good error message?
Yes, absolutely
Maybe it could be reworded a bit but we already have a dedicated error message for this:
>>> import skrub
>>> iter(skrub.var('a'))
Traceback (most recent call last):
...
TypeError: This object is a DataOp that will be evaluated later, when your learner runs. So it is not possible to eagerly iterate over it now.
This issue can be closed once we add a small paragraph to the control_flow.rst file in the documentation that explains the issue a bit. There should be:
- a brief explanation of the situation
- how to address it
- the snippet of code I wrote above as an example
This should give the users an idea of how to deal with this situation.