polars icon indicating copy to clipboard operation
polars copied to clipboard

Concat horizontal strict: ignore empty lazyframes

Open cBournhonesque opened this issue 1 week ago • 1 comments

Description

Concatenating an empty lazyframe with any other lazyframe errors in strict mode:

import polars.selectors as cs
a = pl.LazyFrame({"a": [1, 2], "b": [3, 4]})
b = pl.concat([a.select(cs.all()), a.select(~cs.all())], how="horizontal", strict=True)
b.collect()

This errors with ShapeError: cannot concat dataframes with different heights in 'strict' mode.

In general I think strict should only matter for the columns/lazyframes that are actually being concatenated. This rule is currently being followed for non empty lazyframes, since the following works fine:

a = pl.LazyFrame({"a": [1, 2], "b": [3, 4]})
b = pl.LazyFrame({"c": [1, 2, 3]})
c = pl.concat([a, b], how="horizontal", strict=True)
d = c.select("a").collect()

(c is not the same height but is not considered for 'strict' since none of the columns of c end up being used) but it doesn't seem to handle empty lazyframes.

If this issue is approved, I can work on the implementation.

cBournhonesque avatar Dec 10 '25 15:12 cBournhonesque

I think I agree here, pl.concat with an empty frame should basically be a no-op. This allows for the fairly common pattern:

df = pl.DataFrame()

for _ in _:
    df_tmp = get_some_frame()
    df = pl.concat(df)

Of course, it's probably better to collect the frames into a list instead, and concatenate all at once, but having this option is good.

mcrumiller avatar Dec 10 '25 15:12 mcrumiller

@mcrumiller should I consider this issue accepted? @orlp if you agree I can get started on this

cBournhonesque avatar Dec 17 '25 19:12 cBournhonesque

Nope sorry I am just an avid contributor not a maintainer.

mcrumiller avatar Dec 17 '25 21:12 mcrumiller

We have an exception for the zero-height zero-width DataFrame which may actually be concatenated with any other height DataFrame, but other than that exception the zero-width DataFrame still has a height.

orlp avatar Dec 17 '25 21:12 orlp

Is that right?

import polars as pl
a = pl.LazyFrame({'a': [0] * 99, 'b': [1]* 99})
print(a.select([]).collect().height)
b = pl.concat([a.select('a'), a.select([])], how='horizontal', strict=True).collect()

The height of a.select([]) is 0 (I would have expected it to be 99). And despite the fact that a.select([]) has 0 width and height, it cannot be concatenated with a.select('a'): ShapeError: cannot concat dataframes with different heights in 'strict' mode

I don't know if we want the fix to be: A) a.select([]) should have the same height as a B) allow strict horizontal concat between a lazyframe, and another lazyframe which has 0 width/height

Both seem false at the moment

cBournhonesque avatar Dec 18 '25 17:12 cBournhonesque

I am mostly leaning towards option A. In general empty dataframes don't have height in polars, which can be surprising.

import pandas as pd
df = pd.DataFrame(index=[1, 2])
print(len(df))
pl_df = pl.from_pandas(df)
print(len(pl_df))

For example this prints 2 for pandas and 0 for polars. I would have expected to be allowed to concatenate this to another dataframe of height 2, but I cant'

cBournhonesque avatar Dec 18 '25 19:12 cBournhonesque

Just to point out that df.select(some_expression) with one column doesn't even necessarily have the same height as df. For instance a.select(pl.col("a").unique()).collect().height is 1 whereas a.collect().height is 99.

Also, if df.select().height = 99 then what is df.select().filter(pl.coalesce(pl.col("^.$"), pl.lit(None))==0).height?

Changing the height seems like it might (or dare I say, would probably) produce bugs elsewhere whereas making concat ignore any df with shape (0,0) seems like a pretty self contained change.

That said, I think there's an argument that ignoring shape (0,0) isn't strict and if you want to ignore those while being strict for non-zero heights then you should implement that yourself, maybe something like

def semi_strict_concat(items:[pl.DataFrame|pl.LazyFrame], **kwargs)->pl.DataFrame|pl.LazyFrame:
    non_empty_items=[]
    df_indices=[]
    any_lf=False
    for item in items:
        if isinstance(item, pl.LazyFrame):
            if len(item.collect_schema())!=0:
                any_lf=True
                non_empty_items.append(item)
        elif isinstance(item, pl.DataFrame):
            if item.shape!=(0,0):
                df_indices.append(len(non_empty_items))
                non_empty_items.append(item)
    ## if inputs are a mix of lf and df then make them all lf
    if any_lf and len(df_indices)>0:    
        for df_ind in df_indices:
            non_empty_items[df_ind]=non_empty_items[df_ind].lazy()    
    return pl.concat(non_empty_items, **kwargs)

While my first reaction is that it shouldn't be considered strict to ignore empty dfs, I think in practice I'd actually be indifferent.

Disclaimer: I'm also not a maintainer so these are just my personal thoughts not polars's thoughts.

deanm0000 avatar Dec 19 '25 18:12 deanm0000

I would implement myself; but I don't want to have to pay the cost of calling collect_schema just to handle this edge case.

cBournhonesque avatar Dec 19 '25 20:12 cBournhonesque