polars
polars copied to clipboard
fix(rust): fix melt panic when only one column is present
fix #10075
When looking at this issue, I thought of a couple ways we could handle this:
- Return an empty DataFrame:
This could make sense if you interpret the melt operation as requiring at least two columns: one to serve as the identifier variable(s) and one to serve as the value variable(s). If there's only one column, there are no value variables to melt, so the result is an empty DataFrame.
- Return the single column and its data:
This could make sense if you interpret the melt operation as collapsing all non-identifier columns into a single column. If there's only one column, it's already in the desired format, so the result is the original DataFrame.
- Return a custom error:
This could make sense if you want to enforce that the melt operation should only be used on DataFrames with at least two columns.
I chose to go for the second option, since the idea is of melt is to go from wide to long format, but with a single column it's already considered long
output:
df = pl.DataFrame({'single_column': [1,2,3],})
result = df.melt('single_column')
print(result)
shape: (3, 1)
┌───────────────┐
│ single_column │
│ --- │
│ i64 │
╞═══════════════╡
│ 1 │
│ 2 │
│ 3 │
└───────────────┘
I think it would be good to add this behaviour to the documentation so users are aware of this. Will add the documentation if you agree on my current implementation
not sure about the failed CI, locally there are no errors. Do i need to rebase or something?
I think returning a empty dataframe would make much more sense here. The lenght of the returned dataframe should be the number of ids times the number of columns. If the number of columns is 0, then the returned dataframe should be empty. See this example:
import polars as pl
df = pl.DataFrame({
'ids': ['id1', 'id2'],
'a': [0, 1],
'b': [10, 11]
})
print(df.select('ids', 'a', 'b').melt(id_vars = 'ids'))
# shape: (4, 3)
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ i64 │
# ╞═════╪══════════╪═══════╡
# │ id1 ┆ a ┆ 0 │
# │ id2 ┆ a ┆ 1 │
# │ id1 ┆ b ┆ 10 │
# │ id2 ┆ b ┆ 11 │
# └─────┴──────────┴───────┘
print(df.select('ids', 'a').melt(id_vars = 'ids'))
# shape: (2, 3)
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ i64 │
# ╞═════╪══════════╪═══════╡
# │ id1 ┆ a ┆ 0 │
# │ id2 ┆ a ┆ 1 │
# └─────┴──────────┴───────┘
print(df.select('ids').melt(id_vars = 'ids'))
# shape: (2, 3) <<< I expect this result
# ┌─────┬──────────┬───────┐
# │ ids ┆ variable ┆ value │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ i64 │
# ╞═════╪══════════╪═══════╡
# └─────┴──────────┴───────┘
Note: for what it's worth, this is the way pandas is managing this.
Not sure what the behavior here should be. But I rebased this to fix the failing CI (it was unrelated).
@stinodego what do we want to do with this? Should i ask the community in discord for more opinions? Or do we want o implement as is, and if at any moment we decide other implementation is ore preferred, make the change then?
I'll close this PR as the work here isn't really transferable to the intended solution. I'll submit a fix myself.