Daft icon indicating copy to clipboard operation
Daft copied to clipboard

`struct(col("*"))` creates separate struct columns instead of single struct with all columns

Open universalmind303 opened this issue 9 months ago • 2 comments

Describe the bug

When using struct(col("*")) in a select operation, the function incorrectly creates a separate struct column for each original column, where each struct contains only one field. Instead of creating a single struct column containing all columns as fields.

To Reproduce

df = daft.from_pydict({
    "embeddings": [[1, 2, 3], [4, 5, 6, 7]],
    "text": ["hello world", "goodbye universe"]
})

df.select(daft.struct(col("*"))).collect()
╭─────────────────────────────────┬──────────────────────────╮
│ struct                          ┆ struct                   │
│ ---                             ┆ ---                      │
│ Struct[embeddings: List[Int64]] ┆ Struct[text: Utf8]       │
╞═════════════════════════════════╪══════════════════════════╡
│ {embeddings: [1, 2, 3],         ┆ {text: hello world,      │
│ }                               ┆ }                        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {embeddings: [4, 5, 6, 7],      ┆ {text: goodbye universe, │
│ }                               ┆ }                        │
╰─────────────────────────────────┴──────────────────────────╯
(Showing first 2 of 2 rows)

Expected behavior

struct(col("*")) should produce a single column containing a struct with all original columns as fields:

╭─────────────────────────────────────────────────────────────╮
│ struct                                                      │
│ ---                                                         │
│ Struct[embeddings: List[Int64], text: Utf8]                │
╞═════════════════════════════════════════════════════════════╡
│ {embeddings: [1, 2, 3], text: hello world}                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {embeddings: [4, 5, 6, 7], text: goodbye universe}         │
╰─────────────────────────────────────────────────────────────╯

Component(s)

Expressions

Additional context

The issue appears to be incorrect distributivity.

The struct function is being applied to each individual column that matches the wildcard, rather than being applied to the collection of all matching columns.

universalmind303 avatar Jul 08 '25 15:07 universalmind303

as a workaround, you can spread out the columns manually, and it works as expected

df.select(daft.struct(*df.columns)).collect()
╭─────────────────────────────────────────────╮
│ struct                                      │
│ ---                                         │
│ Struct[embeddings: List[Int64], text: Utf8] │
╞═════════════════════════════════════════════╡
│ {embeddings: [1, 2, 3],                     │
│ text:…                                      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {embeddings: [4, 5, 6, 7],                  │
│ te…                                         │
╰─────────────────────────────────────────────╯
(Showing first 2 of 2 rows)

universalmind303 avatar Jul 08 '25 15:07 universalmind303

@kevinzwang @universalmind303 What are the next steps here?

rohitkulshreshtha avatar Jul 08 '25 23:07 rohitkulshreshtha