narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

feat: Add `narwhals.struct` top level function

Open msalvany opened this issue 1 month ago • 12 comments

What type of PR is this? (check all applicable)

  • [ ] 💾 Refactor
  • [x] ✨ Feature
  • [ ] 🐛 Bug Fix
  • [ ] 🔧 Optimization
  • [ ] 📝 Documentation
  • [ ] ✅ Test
  • [ ] 🐳 Other

Related issues

  • Related issue #3247
  • Closes #3247

Checklist

  • [ ] Code follows style guide (ruff)
  • [x] Tests added
  • [x] Documented the changes

If you have comments or can explain your changes, please do so below

TODO:

  • [x] pandas like
    • [x] doctest
    • [x] docstring
    • [x] code
  • [x] polars
    • [x] doctest
    • [x] docstring
    • [x] code
  • [x] Arrow
    • [x] doctest
    • [x] docstring
    • [x] code

msalvany avatar Oct 31 '25 14:10 msalvany

So far this is what this PR does, I'll attempt polars/arrow next:

df_native_pd = pd.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pd = nw.from_native(df_native_pd)
df_struct_pd = df_pd.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌─────────────────────────────────┐
|       Narwhals DataFrame        |
|---------------------------------|
|                                t|
|0   {'a': 1, 'b': 'x', 'c': True}|
|1  {'a': 2, 'b': 'y', 'c': False}|
|2   {'a': 3, 'b': 'z', 'c': True}|
└─────────────────────────────────┘

What I have not yet figure out is where to place the imports, nor where to add unit test apart from the doctests.

msalvany avatar Oct 31 '25 14:10 msalvany

At this point, we also get these results for polars df and arrow tables:

Polars:

df_native_pl = pl.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pl = nw.from_native(df_native_pl)
df_struct_pl = df_pl.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|  shape: (3, 1)   |
|  ┌───────────┐   |
|  │ t         │   |
|  │ ---       │   |
|  │ struct[2] │   |
|  ╞═══════════╡   |
|  │ {1,"x"}   │   |
|  │ {2,"y"}   │   |
|  │ {3,"z"}   │   |
|  └───────────┘   |
└──────────────────┘

Arrow:

table_native_pa = pa.table({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pa = nw.from_native(table_native_pa)
df_struct_pa = df_pa.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))

┌──────────────────────────────┐
|      Narwhals DataFrame      |
|------------------------------|
|pyarrow.Table                 |
|t: struct<a: int64, b: string>|
|  child 0, a: int64           |
|  child 1, b: string          |
|----                          |
|t: [                          |
|  -- is_valid: all not null   |
|  -- child 0 type: int64      |
|[1,2,3]                       |
|  -- child 1 type: string     |
|["x","y","z"]]                |
└──────────────────────────────┘

msalvany avatar Oct 31 '25 16:10 msalvany

@msalvany I think some wires may have been crossed 😅

This feature is narwhals.struct, which gets the name from polars:

  • https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.struct.html

dangotbanned avatar Oct 31 '25 17:10 dangotbanned

@msalvany I think some wires may have been crossed 😅

Hi @dangotbanned . I see that the original issue is narwhals.struct. But in the discord conversation with @MarcoGorelli we talked about concat_{str, list} (despite concat_list is not yet there). I thought that in the same manner, concat_tuple would work, would it not? That's why I went for concat_struct. But whatever people find more consistent works for me.

msalvany avatar Oct 31 '25 18:10 msalvany

I have started with the tests. I see that there are more backends than pandas, polars and arrow.

(narwhals) ➜  narwhals git:(issue_3247) ✗ pytest tests/expr_and_series/concat_struct_test.py -v -k dryrun --tb=no
============================================================ test session starts ============================================================
platform darwin -- Python 3.12.12, pytest-8.4.2, pluggy-1.6.0 -- /Users/maria/Documents/OpenSource/Narwhals/narwhals/.venv/bin/python3
cachedir: .pytest_cache
Using --randomly-seed=1430920357
hypothesis profile 'default'
rootdir: /Users/maria/Documents/OpenSource/Narwhals/narwhals
configfile: pyproject.toml
plugins: xdist-3.8.0, randomly-4.0.1, hypothesis-6.142.4, env-1.2.0, cov-7.0.0
collected 7 items                                                                                                                           

tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas] PASSED                                                               [ 14%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] FAILED                                                             [ 28%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pyarrow] PASSED                                                              [ 42%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas[pyarrow]] PASSED                                                      [ 57%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[polars[eager]] PASSED                                                        [ 71%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] FAILED                                                                 [ 85%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] FAILED                                                               [100%]

========================================================== short test summary info ==========================================================
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] - AttributeError: 'SparkLikeNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] - AttributeError: 'IbisNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] - AttributeError: 'DuckDBNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
======================================================== 3 failed, 4 passed in 0.53s ========================================================

Should we also implemment the missing ones?

msalvany avatar Oct 31 '25 21:10 msalvany

Hey @msalvany - thanks for the contribution 🚀

As a little side note/to expand a bit more on Dan's comment - we try to mirror the polars API, therefore we will aim to have narwhals.struct as mentioned in the original issue, that behaves the same as the polars.struct function for all the backends .

In a similar way, narwhals.concat_list will mirror polars.concat_list.

However:

I thought that in the same manner, concat_tuple would work

concat_tuple is not a polars function, therefore we won't have it either. There are a few exceptions to this rule, but this is not one of them.


Regarding other backends:

I have started with the tests. I see that there are more backends than pandas, polars and arrow.

For now you can start by xfailing them in the tests. I can see you are already xfailing certain polars version, so you can do something along the following lines:

def test_dryrun(constructor: Constructor, *, request: pytest.FixtureRequest) -> None:
    if "polars" in str(constructor) and POLARS_VERSION < (1, 0, 0):
        # nth only available after 1.0
        request.applymarker(pytest.mark.xfail)

+    if any(x in str(constructor) for x in ("dask", "duckdb", "ibis", "pyspark", "sqlframe")):
+        reason = "Not supported/not implemented"
+        request.applymarker(pytest.mark.xfail(reason))

and in those backend namespaces you can add struct = not_implemented() instead of defining the method.

I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼

FBruzzesi avatar Oct 31 '25 22:10 FBruzzesi

Hi,

Thanks for the clarification @FBruzzesi, I totally get it now! I have changed all concat_struct references to struct.

msalvany avatar Nov 02 '25 12:11 msalvany

Hey @msalvany first and foremost, thanks for updating the PR - it looks close to the finish line 🙏🏼

I have a few of comments, especially regarding tests:

  • In the test, you are running the function, but then it would be good to add a comparison with an expected output. Something along the lines of:
     result = ...
     expected = ...  # <- this is a dictionary that matches the result dataframe content as key: list of values mapping
     assert_data_equal(result, expected)
    
  • Locally make sure to run pytest narwhals --doctest-modules as well. I think there is some formatting misalignment in the docstring example
  • I just noticed that in the contributing guide the part on pre-commit is not very clear. I would suggest to run:
    uv pip install pre-commit
    pre-commit install
    pre-commit run --all-files
    
  • I will update the PR title and convert it to draft - you are always free to change it back whenever you think it's ready

FBruzzesi avatar Nov 03 '25 00:11 FBruzzesi

thanks all! just a comment on

I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼

we should at least verify that this operation is feasible for spark/duckdb. fortunately, in this case, it looks like it's easily done with struct_pack, e.g.

In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")

In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│   a   │   b   │   i   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     1 │     4 │     0 │
│     1 │     5 │     1 │
│     2 │     6 │     2 │
└───────┴───────┴───────┘

In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│   a   │   b   │   i   │      struct_pack(a, b)       │
│ int32 │ int32 │ int32 │ struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│     1 │     4 │     0 │ {'a': 1, 'b': 4}             │
│     1 │     5 │     1 │ {'a': 1, 'b': 5}             │
│     2 │     6 │     2 │ {'a': 2, 'b': 6}             │
└───────┴───────┴───────┴──────────────────────────────┘

in pyspark it looks like it's just struct

MarcoGorelli avatar Nov 03 '25 10:11 MarcoGorelli

In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")

In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│   a   │   b   │   i   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     1 │     4 │     0 │
│     1 │     5 │     1 │
│     2 │     6 │     2 │
└───────┴───────┴───────┘

In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│   a   │   b   │   i   │      struct_pack(a, b)       │
│ int32 │ int32 │ int32 │ struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│     1 │     4 │     0 │ {'a': 1, 'b': 4}             │
│     1 │     5 │     1 │ {'a': 1, 'b': 5}             │
│     2 │     6 │     2 │ {'a': 2, 'b': 6}             │
└───────┴───────┴───────┴──────────────────────────────┘

Hello @MarcoGorelli, I'm going to use your example here to ask if the output we expect after nw.struct() is a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.

If I understand this right, what polars.struct() generates is the 2nd option, but I might be mistaken.

So far, this is what I was mimicking, just let me know if it should be changed. Thanks!

msalvany avatar Nov 03 '25 10:11 msalvany

a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.

this depends on whether you use with_columns or select

MarcoGorelli avatar Nov 03 '25 10:11 MarcoGorelli

in pyspark it looks like it's just struct

I simply tested the struct from pyspark to be sure we get the same, and it looks fine too:

data = [(1, 4, 0), (1, 5, 1), (2, 6, 2)]
columns = ["a", "b", "i"]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, columns)
df_with_struct = df.select("a", "b", "i", struct("a", "b").alias("struct_col"))
df_with_struct.show(truncate=False)
+---+---+---+----------+
|a  |b  |i  |struct_col|
+---+---+---+----------+
|1  |4  |0  |{1, 4}    |
|1  |5  |1  |{1, 5}    |
|2  |6  |2  |{2, 6}    |
+---+---+---+----------+

msalvany avatar Nov 04 '25 11:11 msalvany