narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: Implement `narwhals.struct`

Open danielgafni opened this issue 2 months ago β€’ 7 comments

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Currently Narwhals can read struct fields, but can't create struct columns.

Please describe the purpose of the new feature or describe the problem to solve.

I need to create struct fields with Narwhals for a feature I'm working on.

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

danielgafni avatar Oct 28 '25 00:10 danielgafni

Thanks for the request, sounds good

Note that for pandas this would require pyarrow to be installed

MarcoGorelli avatar Oct 28 '25 07:10 MarcoGorelli

Hi πŸ‘‹

I’d like to take this issue and want to double-check my understanding of the desired behavior for nw.struct(...).

I’ve been exploring different representations of β€œstruct-like” outputs in Pandas of what nw.struct(...) should produce:

import pandas as pd
import pyarrow as pa
import polars as pl
from dataclasses import dataclass, make_dataclass


data = {
    "a": [1, 2, 3],
    "b": ["x", "y", "z"]
}

df = pd.DataFrame( data )
print("direct, would create expanded dataframe:\n", df)

class MyClass:
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __repr__(self):
        return f"(a={self.a}, b={self.b})"

candidate1 = pd.DataFrame( [ {"struct_column": MyClass(**row)} for _, row in df.iterrows() ] )
print("list of structs, (is this the expected output?):\n", candidate1)

# MyDataClass = make_dataclass("MyClass", [("a", int), ("b", str)])
@dataclass
class MyDataClass:
    a: int
    b: str

candidate2 = pd.DataFrame( [ MyDataClass(**row) for _, row in df.iterrows() ] )
print("list of dataclass-structs, the dataframe gets expanded:\n", candidate2)

is_this_correct = pd.DataFrame( [ {"struct_column": MyDataClass(**row)} for _, row in df.iterrows() ] )
print("wrapping the struct into a dict element (is this the expected output?):\n", is_this_correct)

Output:

direct, would create expanded dataframe:
    a  b
0  1  x
1  2  y
2  3  z
list of structs, (is this the expected output?):
   struct_column
0    (a=1, b=x)
1    (a=2, b=y)
2    (a=3, b=z)
list of dataclass-structs, the dataframe gets expanded:
    a  b
0  1  x
1  2  y
2  3  z
wrapping the struct into a dict element (is this the expected output?):
              struct_column
0  MyDataClass(a=1, b='x')
1  MyDataClass(a=2, b='y')
2  MyDataClass(a=3, b='z')

Does this match the expected input/output before I start implementing? If so, I’ll proceed by following the concat_str pattern, but without the extra keyword args (separator, ignore_nulls, etc.).

Thanks!

msalvany avatar Oct 29 '25 20:10 msalvany

In what context are you using Pandas here? I personally view Pandas as a legacy project. It should not be used as reference.

What exactly are you asking about here? Sorry, I don't quite understand your question :) Is it about using non-trivial objects like dataclasses to create structs? If so, I don't think that's the most important use case for structs.

Starting from dictionaries would be enough, and we could add support to other types in future PRs.

Imo, if you want to explore patterns and behaviors with structs, you should be looking at Polars instead. Polars can also handle Pydantic dataclasses, by the way.

danielgafni avatar Oct 30 '25 01:10 danielgafni

Hi @danielgafni, thanks for the clarification!

Just to explain where I was coming from: since I’m still new to struct types, I started with Pandas only to better understand the expected input/output shape of the function. I wanted to make sure I interpreted the goal of nw.struct(...) correctly, not to rely on Pandas as the reference implementation.

I’m more familiar with Pandas at the moment, so it helped me reason things through, but I’ll switch over to Polars now and try replicating the same logic there.

msalvany avatar Oct 30 '25 08:10 msalvany

thanks for the question! we'll need to support this in pandas, so you're absolutely right to start looking there

if we make a pandas dataframe with pyarrow dtypes:

In [20]: df = pd.DataFrame({'a': [1,1,2,3], 'b': [4,5,6,7]}).convert_dtypes(dtype_backend='pyarrow')

In [21]: df
Out[21]:
   a  b
0  1  4
1  1  5
2  2  6
3  3  7

then we need to be able to make a new column which has pd.ArrowDType(pa.struct) type. We can do that by:

  • getting the arrow arrays corresponding to each series we can to make a struct out of
  • passing them to pyarrow.compute.make_struct
  • calling to_pandas

e.g.:

In [20]: df = pd.DataFrame({'a': [1,1,2,3], 'b': [4,5,6,7]}).convert_dtypes(dtype_backend='pyarrow')

In [21]: df
Out[21]:
   a  b
0  1  4
1  1  5
2  2  6
3  3  7

In [23]: s = pc.make_struct(df['a'].array._pa_array, df['b'].array._pa_array).to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))

In [24]: s
Out[24]:
0    {'0': 1, '1': 4}
1    {'0': 1, '1': 5}
2    {'0': 2, '1': 6}
3    {'0': 3, '1': 7}
dtype: struct<0: int64, 1: int64>[pyarrow]

MarcoGorelli avatar Oct 30 '25 09:10 MarcoGorelli

Hi! Here’s an update on nw.struct(...)

nw.struct("a", "b", "c")

Top-level constructor

  • [x] I’ve implemented the struct(...) function to be in functions.py:
# struct() constructor function
# to be saved in narwhals/functions.py

def struct(
    exprs: IntoExpr | Iterable[IntoExpr],
    *more_exprs: IntoExpr,
) -> Expr:
    """
    Horizontally combine multiple columns into a single struct column.

    Arguments:
        exprs: One or more expressions to combine into a struct. Strings are treated as column names.
        *more_exprs: Additional columns or expressions, passed as positional arguments.

    Returns:
        An expression that produces a single struct column containing the given fields.

    Example:
        >>> import pandas as pd
        >>> import narwhals as nw
        >>>
        >>> data = {
        ...     "a": [1, 2, 3],
        ...     "b": ["dogs", "cats", None],
        ...     "c": ["play", "swim", "walk"],
        ... }
        >>> df_native = pd.DataFrame(data)
        >>> (
        ...     nw.from_native(df_native).select(
        ...         nw.struct("a", "b").alias("my_struct")
        ...     )
        ... )
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        |     Narwhals DataFrame   |
        |--------------------------|
        |     my_struct            |
        | 0  {'a': 1, 'b': 'dogs'} |
        | 1  {'a': 2, 'b': 'cats'} |
        | 2  {'a': 3, 'b': None}   |
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    """
    flat_exprs = flatten([*flatten([exprs]), *more_exprs])
    parsed_exprs = [parse_expr(e) for e in flat_exprs]
    node = ExprNode(kind="struct", exprs=parsed_exprs) # Build a symbolic ExprNode representing "make a struct"
    return Expr([node]) # Return an Expr object wrapping the node

Backend logic (tested separately) I’ve prototyped and tested backend implementations for:

  • [x] Pandas

(using convert_dtypes(dtype_backend="pyarrow") + pyarrow.compute.make_struct)

# Backend implementation of nw.struct(...) in Pandas

def make_struct_df_pd(df: pd.DataFrame, columns: list[str], struct_col_name: str = "struct") -> pd.DataFrame:
    df_arrow = df[columns].convert_dtypes(dtype_backend="pyarrow") # Convert selected columns to Arrow dtype
    arrays = [df_arrow[col].array._pa_array for col in columns] # Create Arrow arrays from the selected columns
    struct_array = pc.make_struct(*arrays, field_names=columns) # Create struct array containing columns names as field names
    struct_series = struct_array.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x)) # Convert array to pandas Series
    return pd.DataFrame({struct_col_name: struct_series}) # Return a new DataFrame with just the struct column

# Example
df = pd.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})

struct_df = make_struct_df_pd(df, columns=["a", "b", "c"], struct_col_name="my_struct")
                        my_struct
0   {'a': 1, 'b': 'x', 'c': True}
1  {'a': 2, 'b': 'y', 'c': False}
2   {'a': 3, 'b': 'z', 'c': True}
my_struct    struct<a: int64, b: string, c: bool>[pyarrow]
dtype: object
  • [x] Polars
# Backend implementation of nw.struct(...) in Polars

def make_struct_df_pl(df: pl.DataFrame, columns: list[str], struct_col_name: str = "struct") -> pl.DataFrame:
    return df.select(
        pl.struct(columns).alias(struct_col_name)
    )

# Example

df = pl.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})

struct_df = make_struct_df_pl(df, columns=["a", "b", "c"], struct_col_name="my_struct")
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ my_struct     β”‚
β”‚ ---           β”‚
β”‚ struct[3]     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ {1,"x",true}  β”‚
β”‚ {2,"y",false} β”‚
β”‚ {3,"z",true}  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
[Struct({'a': Int64, 'b': String, 'c': Boolean})]
  • [x] Arrow
# Backend implementation of nw.struct(...) in Arrow

def make_struct_table_pa(table: pa.Table, columns: list[str], struct_col_name: str = "struct") -> pa.Table:
    arrays = [table[column].combine_chunks() for column in columns] # Combine each column into a single Arrow Array
    struct_array = pc.make_struct(*arrays, field_names=columns) # Unpack the arrays into make_struct
    return pa.table({struct_col_name: struct_array}) # Return a new Table with only the struct column

# Example
table = pa.table({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})

struct_table = make_struct_table_pa(table, columns=["a", "b"], struct_col_name="my_struct")
{'my_struct': [{'a': 1, 'b': 'x'}, {'a': 2, 'b': 'y'}, {'a': 3, 'b': 'z'}]}
pyarrow.Table
my_struct: struct<a: int64, b: string>
  child 0, a: int64
  child 1, b: string
----
my_struct: [
  -- is_valid: all not null
  -- child 0 type: int64
[1,2,3]
  -- child 1 type: string
["x","y","z"]]

Let me know if this direction looks good before I move on to integrating the backend implementations into the respective namespace classes (PandasLikeExpr, PolarsExpr, ArrowExpr). Happy to adjust if needed!

Thanks again :)

msalvany avatar Oct 30 '25 17:10 msalvany

this looks like a start, yes, there's some details we'll need to care of but i'd suggesting trying to piece everything together

MarcoGorelli avatar Oct 30 '25 17:10 MarcoGorelli