polars fix(rust): fix melt panic when only one column is present

fix #10075

When looking at this issue, I thought of a couple ways we could handle this:

Return an empty DataFrame:

This could make sense if you interpret the melt operation as requiring at least two columns: one to serve as the identifier variable(s) and one to serve as the value variable(s). If there's only one column, there are no value variables to melt, so the result is an empty DataFrame.

Return the single column and its data:

This could make sense if you interpret the melt operation as collapsing all non-identifier columns into a single column. If there's only one column, it's already in the desired format, so the result is the original DataFrame.

Return a custom error:

This could make sense if you want to enforce that the melt operation should only be used on DataFrames with at least two columns.

I chose to go for the second option, since the idea is of melt is to go from wide to long format, but with a single column it's already considered long

output:

df = pl.DataFrame({'single_column': [1,2,3],})
result = df.melt('single_column')
print(result)

shape: (3, 1)
┌───────────────┐
│ single_column │
│ ---           │
│ i64           │
╞═══════════════╡
│ 1             │
│ 2             │
│ 3             │
└───────────────┘

Oct 20 '23 16:10 romanovacca

I think it would be good to add this behaviour to the documentation so users are aware of this. Will add the documentation if you agree on my current implementation

Oct 20 '23 21:10 romanovacca

not sure about the failed CI, locally there are no errors. Do i need to rebase or something?

Oct 22 '23 17:10 romanovacca

I think returning a empty dataframe would make much more sense here. The lenght of the returned dataframe should be the number of ids times the number of columns. If the number of columns is 0, then the returned dataframe should be empty. See this example:

import polars as pl

df = pl.DataFrame({
    'ids': ['id1', 'id2'],
    'a': [0, 1],
    'b': [10, 11]
})

print(df.select('ids', 'a', 'b').melt(id_vars = 'ids'))
# shape: (4, 3)
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ id1 ┆ a        ┆ 0     │
# │ id2 ┆ a        ┆ 1     │
# │ id1 ┆ b        ┆ 10    │
# │ id2 ┆ b        ┆ 11    │
# └─────┴──────────┴───────┘

print(df.select('ids', 'a').melt(id_vars = 'ids'))
# shape: (2, 3)
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ id1 ┆ a        ┆ 0     │
# │ id2 ┆ a        ┆ 1     │
# └─────┴──────────┴───────┘

print(df.select('ids').melt(id_vars = 'ids'))
# shape: (2, 3)                  <<< I expect this result
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# └─────┴──────────┴───────┘

Note: for what it's worth, this is the way pandas is managing this.

Oct 23 '23 08:10 gab23r

Not sure what the behavior here should be. But I rebased this to fix the failing CI (it was unrelated).

Oct 25 '23 07:10 stinodego

@stinodego what do we want to do with this? Should i ask the community in discord for more opinions? Or do we want o implement as is, and if at any moment we decide other implementation is ore preferred, make the change then?

Dec 22 '23 10:12 romanovacca

I'll close this PR as the work here isn't really transferable to the intended solution. I'll submit a fix myself.

Jan 28 '24 23:01 stinodego

polars polars copied to clipboard

fix(rust): fix melt panic when only one column is present

polars
polars copied to clipboard