feat: Add `narwhals.struct` top level function
What type of PR is this? (check all applicable)
- [ ] 💾 Refactor
- [x] ✨ Feature
- [ ] 🐛 Bug Fix
- [ ] 🔧 Optimization
- [ ] 📝 Documentation
- [ ] ✅ Test
- [ ] 🐳 Other
Related issues
- Related issue #3247
- Closes #3247
Checklist
- [ ] Code follows style guide (ruff)
- [x] Tests added
- [x] Documented the changes
If you have comments or can explain your changes, please do so below
TODO:
- [x] pandas like
- [x] doctest
- [x] docstring
- [x] code
- [x] polars
- [x] doctest
- [x] docstring
- [x] code
- [x] Arrow
- [x] doctest
- [x] docstring
- [x] code
So far this is what this PR does, I'll attempt polars/arrow next:
df_native_pd = pd.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pd = nw.from_native(df_native_pd)
df_struct_pd = df_pd.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌─────────────────────────────────┐
| Narwhals DataFrame |
|---------------------------------|
| t|
|0 {'a': 1, 'b': 'x', 'c': True}|
|1 {'a': 2, 'b': 'y', 'c': False}|
|2 {'a': 3, 'b': 'z', 'c': True}|
└─────────────────────────────────┘
What I have not yet figure out is where to place the imports, nor where to add unit test apart from the doctests.
At this point, we also get these results for polars df and arrow tables:
Polars:
df_native_pl = pl.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pl = nw.from_native(df_native_pl)
df_struct_pl = df_pl.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| shape: (3, 1) |
| ┌───────────┐ |
| │ t │ |
| │ --- │ |
| │ struct[2] │ |
| ╞═══════════╡ |
| │ {1,"x"} │ |
| │ {2,"y"} │ |
| │ {3,"z"} │ |
| └───────────┘ |
└──────────────────┘
Arrow:
table_native_pa = pa.table({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pa = nw.from_native(table_native_pa)
df_struct_pa = df_pa.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌──────────────────────────────┐
| Narwhals DataFrame |
|------------------------------|
|pyarrow.Table |
|t: struct<a: int64, b: string>|
| child 0, a: int64 |
| child 1, b: string |
|---- |
|t: [ |
| -- is_valid: all not null |
| -- child 0 type: int64 |
|[1,2,3] |
| -- child 1 type: string |
|["x","y","z"]] |
└──────────────────────────────┘
@msalvany I think some wires may have been crossed 😅
This feature is narwhals.struct, which gets the name from polars:
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.struct.html
@msalvany I think some wires may have been crossed 😅
Hi @dangotbanned . I see that the original issue is narwhals.struct. But in the discord conversation with @MarcoGorelli we talked about concat_{str, list} (despite concat_list is not yet there). I thought that in the same manner, concat_tuple would work, would it not? That's why I went for concat_struct. But whatever people find more consistent works for me.
I have started with the tests. I see that there are more backends than pandas, polars and arrow.
(narwhals) ➜ narwhals git:(issue_3247) ✗ pytest tests/expr_and_series/concat_struct_test.py -v -k dryrun --tb=no
============================================================ test session starts ============================================================
platform darwin -- Python 3.12.12, pytest-8.4.2, pluggy-1.6.0 -- /Users/maria/Documents/OpenSource/Narwhals/narwhals/.venv/bin/python3
cachedir: .pytest_cache
Using --randomly-seed=1430920357
hypothesis profile 'default'
rootdir: /Users/maria/Documents/OpenSource/Narwhals/narwhals
configfile: pyproject.toml
plugins: xdist-3.8.0, randomly-4.0.1, hypothesis-6.142.4, env-1.2.0, cov-7.0.0
collected 7 items
tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas] PASSED [ 14%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] FAILED [ 28%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pyarrow] PASSED [ 42%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas[pyarrow]] PASSED [ 57%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[polars[eager]] PASSED [ 71%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] FAILED [ 85%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] FAILED [100%]
========================================================== short test summary info ==========================================================
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] - AttributeError: 'SparkLikeNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] - AttributeError: 'IbisNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] - AttributeError: 'DuckDBNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
======================================================== 3 failed, 4 passed in 0.53s ========================================================
Should we also implemment the missing ones?
Hey @msalvany - thanks for the contribution 🚀
As a little side note/to expand a bit more on Dan's comment - we try to mirror the polars API, therefore we will aim to have narwhals.struct as mentioned in the original issue, that behaves the same as the polars.struct function for all the backends .
In a similar way, narwhals.concat_list will mirror polars.concat_list.
However:
I thought that in the same manner,
concat_tuplewould work
concat_tuple is not a polars function, therefore we won't have it either. There are a few exceptions to this rule, but this is not one of them.
Regarding other backends:
I have started with the tests. I see that there are more backends than pandas, polars and arrow.
For now you can start by xfailing them in the tests. I can see you are already xfailing certain polars version, so you can do something along the following lines:
def test_dryrun(constructor: Constructor, *, request: pytest.FixtureRequest) -> None:
if "polars" in str(constructor) and POLARS_VERSION < (1, 0, 0):
# nth only available after 1.0
request.applymarker(pytest.mark.xfail)
+ if any(x in str(constructor) for x in ("dask", "duckdb", "ibis", "pyspark", "sqlframe")):
+ reason = "Not supported/not implemented"
+ request.applymarker(pytest.mark.xfail(reason))
and in those backend namespaces you can add struct = not_implemented() instead of defining the method.
I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼
Hi,
Thanks for the clarification @FBruzzesi, I totally get it now! I have changed all concat_struct references to struct.
Hey @msalvany first and foremost, thanks for updating the PR - it looks close to the finish line 🙏🏼
I have a few of comments, especially regarding tests:
- In the test, you are running the function, but then it would be good to add a comparison with an expected output. Something along the lines of:
result = ... expected = ... # <- this is a dictionary that matches the result dataframe content as key: list of values mapping assert_data_equal(result, expected) - Locally make sure to run
pytest narwhals --doctest-modulesas well. I think there is some formatting misalignment in the docstring example - I just noticed that in the contributing guide the part on
pre-commitis not very clear. I would suggest to run:uv pip install pre-commit pre-commit install pre-commit run --all-files - I will update the PR title and convert it to draft - you are always free to change it back whenever you think it's ready
thanks all! just a comment on
I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼
we should at least verify that this operation is feasible for spark/duckdb. fortunately, in this case, it looks like it's easily done with struct_pack, e.g.
In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")
In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│ a │ b │ i │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 1 │ 4 │ 0 │
│ 1 │ 5 │ 1 │
│ 2 │ 6 │ 2 │
└───────┴───────┴───────┘
In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│ a │ b │ i │ struct_pack(a, b) │
│ int32 │ int32 │ int32 │ struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│ 1 │ 4 │ 0 │ {'a': 1, 'b': 4} │
│ 1 │ 5 │ 1 │ {'a': 1, 'b': 5} │
│ 2 │ 6 │ 2 │ {'a': 2, 'b': 6} │
└───────┴───────┴───────┴──────────────────────────────┘
in pyspark it looks like it's just struct
In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)") In [36]: rel Out[36]: ┌───────┬───────┬───────┐ │ a │ b │ i │ │ int32 │ int32 │ int32 │ ├───────┼───────┼───────┤ │ 1 │ 4 │ 0 │ │ 1 │ 5 │ 1 │ │ 2 │ 6 │ 2 │ └───────┴───────┴───────┘ In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b')) Out[37]: ┌───────┬───────┬───────┬──────────────────────────────┐ │ a │ b │ i │ struct_pack(a, b) │ │ int32 │ int32 │ int32 │ struct(a integer, b integer) │ ├───────┼───────┼───────┼──────────────────────────────┤ │ 1 │ 4 │ 0 │ {'a': 1, 'b': 4} │ │ 1 │ 5 │ 1 │ {'a': 1, 'b': 5} │ │ 2 │ 6 │ 2 │ {'a': 2, 'b': 6} │ └───────┴───────┴───────┴──────────────────────────────┘
Hello @MarcoGorelli, I'm going to use your example here to ask if the output we expect after nw.struct() is a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.
If I understand this right, what polars.struct() generates is the 2nd option, but I might be mistaken.
So far, this is what I was mimicking, just let me know if it should be changed. Thanks!
a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.
this depends on whether you use with_columns or select
in pyspark it looks like it's just struct
I simply tested the struct from pyspark to be sure we get the same, and it looks fine too:
data = [(1, 4, 0), (1, 5, 1), (2, 6, 2)]
columns = ["a", "b", "i"]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, columns)
df_with_struct = df.select("a", "b", "i", struct("a", "b").alias("struct_col"))
df_with_struct.show(truncate=False)
+---+---+---+----------+
|a |b |i |struct_col|
+---+---+---+----------+
|1 |4 |0 |{1, 4} |
|1 |5 |1 |{1, 5} |
|2 |6 |2 |{2, 6} |
+---+---+---+----------+