[Enh]: Implement `narwhals.struct`
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
Currently Narwhals can read struct fields, but can't create struct columns.
Please describe the purpose of the new feature or describe the problem to solve.
I need to create struct fields with Narwhals for a feature I'm working on.
Suggest a solution if possible.
No response
If you have tried alternatives, please describe them below.
No response
Additional information that may help us understand your needs.
No response
Thanks for the request, sounds good
Note that for pandas this would require pyarrow to be installed
Hi π
Iβd like to take this issue and want to double-check my understanding of the desired behavior for nw.struct(...).
Iβve been exploring different representations of βstruct-likeβ outputs in Pandas of what nw.struct(...) should produce:
import pandas as pd
import pyarrow as pa
import polars as pl
from dataclasses import dataclass, make_dataclass
data = {
"a": [1, 2, 3],
"b": ["x", "y", "z"]
}
df = pd.DataFrame( data )
print("direct, would create expanded dataframe:\n", df)
class MyClass:
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"(a={self.a}, b={self.b})"
candidate1 = pd.DataFrame( [ {"struct_column": MyClass(**row)} for _, row in df.iterrows() ] )
print("list of structs, (is this the expected output?):\n", candidate1)
# MyDataClass = make_dataclass("MyClass", [("a", int), ("b", str)])
@dataclass
class MyDataClass:
a: int
b: str
candidate2 = pd.DataFrame( [ MyDataClass(**row) for _, row in df.iterrows() ] )
print("list of dataclass-structs, the dataframe gets expanded:\n", candidate2)
is_this_correct = pd.DataFrame( [ {"struct_column": MyDataClass(**row)} for _, row in df.iterrows() ] )
print("wrapping the struct into a dict element (is this the expected output?):\n", is_this_correct)
Output:
direct, would create expanded dataframe:
a b
0 1 x
1 2 y
2 3 z
list of structs, (is this the expected output?):
struct_column
0 (a=1, b=x)
1 (a=2, b=y)
2 (a=3, b=z)
list of dataclass-structs, the dataframe gets expanded:
a b
0 1 x
1 2 y
2 3 z
wrapping the struct into a dict element (is this the expected output?):
struct_column
0 MyDataClass(a=1, b='x')
1 MyDataClass(a=2, b='y')
2 MyDataClass(a=3, b='z')
Does this match the expected input/output before I start implementing?
If so, Iβll proceed by following the concat_str pattern, but without the extra keyword args (separator, ignore_nulls, etc.).
Thanks!
In what context are you using Pandas here? I personally view Pandas as a legacy project. It should not be used as reference.
What exactly are you asking about here? Sorry, I don't quite understand your question :) Is it about using non-trivial objects like dataclasses to create structs? If so, I don't think that's the most important use case for structs.
Starting from dictionaries would be enough, and we could add support to other types in future PRs.
Imo, if you want to explore patterns and behaviors with structs, you should be looking at Polars instead. Polars can also handle Pydantic dataclasses, by the way.
Hi @danielgafni, thanks for the clarification!
Just to explain where I was coming from: since Iβm still new to struct types, I started with Pandas only to better understand the expected input/output shape of the function. I wanted to make sure I interpreted the goal of nw.struct(...) correctly, not to rely on Pandas as the reference implementation.
Iβm more familiar with Pandas at the moment, so it helped me reason things through, but Iβll switch over to Polars now and try replicating the same logic there.
thanks for the question! we'll need to support this in pandas, so you're absolutely right to start looking there
if we make a pandas dataframe with pyarrow dtypes:
In [20]: df = pd.DataFrame({'a': [1,1,2,3], 'b': [4,5,6,7]}).convert_dtypes(dtype_backend='pyarrow')
In [21]: df
Out[21]:
a b
0 1 4
1 1 5
2 2 6
3 3 7
then we need to be able to make a new column which has pd.ArrowDType(pa.struct) type. We can do that by:
- getting the arrow arrays corresponding to each series we can to make a struct out of
- passing them to
pyarrow.compute.make_struct - calling
to_pandas
e.g.:
In [20]: df = pd.DataFrame({'a': [1,1,2,3], 'b': [4,5,6,7]}).convert_dtypes(dtype_backend='pyarrow')
In [21]: df
Out[21]:
a b
0 1 4
1 1 5
2 2 6
3 3 7
In [23]: s = pc.make_struct(df['a'].array._pa_array, df['b'].array._pa_array).to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))
In [24]: s
Out[24]:
0 {'0': 1, '1': 4}
1 {'0': 1, '1': 5}
2 {'0': 2, '1': 6}
3 {'0': 3, '1': 7}
dtype: struct<0: int64, 1: int64>[pyarrow]
Hi! Hereβs an update on nw.struct(...)
nw.struct("a", "b", "c")
Top-level constructor
- [x] Iβve implemented the struct(...) function to be in functions.py:
# struct() constructor function
# to be saved in narwhals/functions.py
def struct(
exprs: IntoExpr | Iterable[IntoExpr],
*more_exprs: IntoExpr,
) -> Expr:
"""
Horizontally combine multiple columns into a single struct column.
Arguments:
exprs: One or more expressions to combine into a struct. Strings are treated as column names.
*more_exprs: Additional columns or expressions, passed as positional arguments.
Returns:
An expression that produces a single struct column containing the given fields.
Example:
>>> import pandas as pd
>>> import narwhals as nw
>>>
>>> data = {
... "a": [1, 2, 3],
... "b": ["dogs", "cats", None],
... "c": ["play", "swim", "walk"],
... }
>>> df_native = pd.DataFrame(data)
>>> (
... nw.from_native(df_native).select(
... nw.struct("a", "b").alias("my_struct")
... )
... )
ββββββββββββββββββββββββββββ
| Narwhals DataFrame |
|--------------------------|
| my_struct |
| 0 {'a': 1, 'b': 'dogs'} |
| 1 {'a': 2, 'b': 'cats'} |
| 2 {'a': 3, 'b': None} |
ββββββββββββββββββββββββββββ
"""
flat_exprs = flatten([*flatten([exprs]), *more_exprs])
parsed_exprs = [parse_expr(e) for e in flat_exprs]
node = ExprNode(kind="struct", exprs=parsed_exprs) # Build a symbolic ExprNode representing "make a struct"
return Expr([node]) # Return an Expr object wrapping the node
Backend logic (tested separately) Iβve prototyped and tested backend implementations for:
- [x] Pandas
(using convert_dtypes(dtype_backend="pyarrow") + pyarrow.compute.make_struct)
# Backend implementation of nw.struct(...) in Pandas
def make_struct_df_pd(df: pd.DataFrame, columns: list[str], struct_col_name: str = "struct") -> pd.DataFrame:
df_arrow = df[columns].convert_dtypes(dtype_backend="pyarrow") # Convert selected columns to Arrow dtype
arrays = [df_arrow[col].array._pa_array for col in columns] # Create Arrow arrays from the selected columns
struct_array = pc.make_struct(*arrays, field_names=columns) # Create struct array containing columns names as field names
struct_series = struct_array.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x)) # Convert array to pandas Series
return pd.DataFrame({struct_col_name: struct_series}) # Return a new DataFrame with just the struct column
# Example
df = pd.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
struct_df = make_struct_df_pd(df, columns=["a", "b", "c"], struct_col_name="my_struct")
my_struct
0 {'a': 1, 'b': 'x', 'c': True}
1 {'a': 2, 'b': 'y', 'c': False}
2 {'a': 3, 'b': 'z', 'c': True}
my_struct struct<a: int64, b: string, c: bool>[pyarrow]
dtype: object
- [x] Polars
# Backend implementation of nw.struct(...) in Polars
def make_struct_df_pl(df: pl.DataFrame, columns: list[str], struct_col_name: str = "struct") -> pl.DataFrame:
return df.select(
pl.struct(columns).alias(struct_col_name)
)
# Example
df = pl.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
struct_df = make_struct_df_pl(df, columns=["a", "b", "c"], struct_col_name="my_struct")
shape: (3, 1)
βββββββββββββββββ
β my_struct β
β --- β
β struct[3] β
βββββββββββββββββ‘
β {1,"x",true} β
β {2,"y",false} β
β {3,"z",true} β
βββββββββββββββββ
[Struct({'a': Int64, 'b': String, 'c': Boolean})]
- [x] Arrow
# Backend implementation of nw.struct(...) in Arrow
def make_struct_table_pa(table: pa.Table, columns: list[str], struct_col_name: str = "struct") -> pa.Table:
arrays = [table[column].combine_chunks() for column in columns] # Combine each column into a single Arrow Array
struct_array = pc.make_struct(*arrays, field_names=columns) # Unpack the arrays into make_struct
return pa.table({struct_col_name: struct_array}) # Return a new Table with only the struct column
# Example
table = pa.table({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
struct_table = make_struct_table_pa(table, columns=["a", "b"], struct_col_name="my_struct")
{'my_struct': [{'a': 1, 'b': 'x'}, {'a': 2, 'b': 'y'}, {'a': 3, 'b': 'z'}]}
pyarrow.Table
my_struct: struct<a: int64, b: string>
child 0, a: int64
child 1, b: string
----
my_struct: [
-- is_valid: all not null
-- child 0 type: int64
[1,2,3]
-- child 1 type: string
["x","y","z"]]
Let me know if this direction looks good before I move on to integrating the backend implementations into the respective namespace classes (PandasLikeExpr, PolarsExpr, ArrowExpr). Happy to adjust if needed!
Thanks again :)
this looks like a start, yes, there's some details we'll need to care of but i'd suggesting trying to piece everything together