bug: daft.method(unnest=True) returns the original with_column name
Describe the bug
Instead of returning the names of the fields from a struct, the original name is applied to multiple columns yielding an ambiguous column reference error when selecting on "result' or no column name found on "bar'.
To Reproduce
import daft
@daft.cls()
class Foo:
def __init__(self, x: str):
self.x = x
@daft.method(
return_dtype=daft.DataType.struct({
"bar": daft.DataType.string(),
"some_int": daft.DataType.int64(),
}),
unnest=True
)
def do_something(self, input: str):
return {
"bar": input + self.x,
"some_int": 3,
}
foobar = Foo("bar")
df = daft.from_pydict({"input": ["daft is cool"]}).with_column("result", foobar.do_something(daft.col("input")))
df.show()
# Returns
╭──────────────┬─────────────────┬────────╮
│ input ┆ result ┆ result │
│ --- ┆ --- ┆ --- │
│ String ┆ String ┆ Int64 │
╞══════════════╪═════════════════╪════════╡
│ daft is cool ┆ daft is coolbar ┆ 3 │
╰──────────────┴─────────────────┴────────╯
Expected behavior
Should return
╭──────────────┬─────────────────┬──────────╮
│ input ┆ bar ┆ some_int │
│ --- ┆ --- ┆ --- │
│ String ┆ String ┆ Int64 │
╞══════════════╪═════════════════╪══════════╡
│ daft is cool ┆ daft is coolbar ┆ 3 │
╰──────────────┴─────────────────┴──────────╯
Component(s)
Other
Additional context
No response
@kevinzwang I know you initially implemented this, and the new index based schema resolution. Would you want to take a look into this? I've come across this bug before as well.
Yeah this is an issue with with_column and unnest, since with_column simply adds an alias to the expression, which is then propagated to all of the children expressions. I would suggest using df.select("*", foobar.do_something(..)) instead for now, but the interaction of unnest with our projection operators is something that I don't have a great story around right now and we should think about. Open to hearing what others think!
I was almost wondering if unnest would just use the struct column name as a prefix for nested fields. That way you could run unnested structured outputs multiple times in the same dataframe without collisions.