Concat horizontal strict: ignore empty lazyframes
Description
Concatenating an empty lazyframe with any other lazyframe errors in strict mode:
import polars.selectors as cs
a = pl.LazyFrame({"a": [1, 2], "b": [3, 4]})
b = pl.concat([a.select(cs.all()), a.select(~cs.all())], how="horizontal", strict=True)
b.collect()
This errors with ShapeError: cannot concat dataframes with different heights in 'strict' mode.
In general I think strict should only matter for the columns/lazyframes that are actually being concatenated.
This rule is currently being followed for non empty lazyframes, since the following works fine:
a = pl.LazyFrame({"a": [1, 2], "b": [3, 4]})
b = pl.LazyFrame({"c": [1, 2, 3]})
c = pl.concat([a, b], how="horizontal", strict=True)
d = c.select("a").collect()
(c is not the same height but is not considered for 'strict' since none of the columns of c end up being used) but it doesn't seem to handle empty lazyframes.
If this issue is approved, I can work on the implementation.
I think I agree here, pl.concat with an empty frame should basically be a no-op. This allows for the fairly common pattern:
df = pl.DataFrame()
for _ in _:
df_tmp = get_some_frame()
df = pl.concat(df)
Of course, it's probably better to collect the frames into a list instead, and concatenate all at once, but having this option is good.
@mcrumiller should I consider this issue accepted? @orlp if you agree I can get started on this
Nope sorry I am just an avid contributor not a maintainer.
We have an exception for the zero-height zero-width DataFrame which may actually be concatenated with any other height DataFrame, but other than that exception the zero-width DataFrame still has a height.
Is that right?
import polars as pl
a = pl.LazyFrame({'a': [0] * 99, 'b': [1]* 99})
print(a.select([]).collect().height)
b = pl.concat([a.select('a'), a.select([])], how='horizontal', strict=True).collect()
The height of a.select([]) is 0 (I would have expected it to be 99).
And despite the fact that a.select([]) has 0 width and height, it cannot be concatenated with a.select('a'): ShapeError: cannot concat dataframes with different heights in 'strict' mode
I don't know if we want the fix to be:
A) a.select([]) should have the same height as a
B) allow strict horizontal concat between a lazyframe, and another lazyframe which has 0 width/height
Both seem false at the moment
I am mostly leaning towards option A. In general empty dataframes don't have height in polars, which can be surprising.
import pandas as pd
df = pd.DataFrame(index=[1, 2])
print(len(df))
pl_df = pl.from_pandas(df)
print(len(pl_df))
For example this prints 2 for pandas and 0 for polars.
I would have expected to be allowed to concatenate this to another dataframe of height 2, but I cant'
Just to point out that df.select(some_expression) with one column doesn't even necessarily have the same height as df. For instance a.select(pl.col("a").unique()).collect().height is 1 whereas a.collect().height is 99.
Also, if df.select().height = 99 then what is df.select().filter(pl.coalesce(pl.col("^.$"), pl.lit(None))==0).height?
Changing the height seems like it might (or dare I say, would probably) produce bugs elsewhere whereas making concat ignore any df with shape (0,0) seems like a pretty self contained change.
That said, I think there's an argument that ignoring shape (0,0) isn't strict and if you want to ignore those while being strict for non-zero heights then you should implement that yourself, maybe something like
def semi_strict_concat(items:[pl.DataFrame|pl.LazyFrame], **kwargs)->pl.DataFrame|pl.LazyFrame:
non_empty_items=[]
df_indices=[]
any_lf=False
for item in items:
if isinstance(item, pl.LazyFrame):
if len(item.collect_schema())!=0:
any_lf=True
non_empty_items.append(item)
elif isinstance(item, pl.DataFrame):
if item.shape!=(0,0):
df_indices.append(len(non_empty_items))
non_empty_items.append(item)
## if inputs are a mix of lf and df then make them all lf
if any_lf and len(df_indices)>0:
for df_ind in df_indices:
non_empty_items[df_ind]=non_empty_items[df_ind].lazy()
return pl.concat(non_empty_items, **kwargs)
While my first reaction is that it shouldn't be considered strict to ignore empty dfs, I think in practice I'd actually be indifferent.
Disclaimer: I'm also not a maintainer so these are just my personal thoughts not polars's thoughts.
I would implement myself; but I don't want to have to pay the cost of calling collect_schema just to handle this edge case.