polars icon indicating copy to clipboard operation
polars copied to clipboard

ColumnNotFoundError on concat of LazyFrames; works fine if collected first

Open acowlikeobject opened this issue 10 months ago • 8 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Sadly, the construction of these dataframes is long and complex, so I'll have to figure out how to create a minimum reproducible example. In the mean time, I'm hoping someone has an intuition about the problem.

If there's any other attributes on Polars objects that would narrow this down, please let me know.

assert type(df1) == type(df2) == pl.LazyFrame
assert df1.schema == df2.schema

pl.concat([df1.collect(), df2.collect()])  # Works fine if collected first.

pl.concat([df1, df2]).collect()  # Fails with output below.

Log output

avg line length: 57.054688
std. dev. line length: 8.579314
initial row estimate: 2116
no. of chunks: 8 processed by: 8 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
dataframe filtered
dataframe filtered
keys/aggregates are not partitionable: running default HASH AGGREGATION
avg line length: 117.78516
std. dev. line length: 14.399422
initial row estimate: 10842
no. of chunks: 8 processed by: 8 threads.
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
dataframe filtered
dataframe filtered

UNION: `parallel=false` union is run sequentially
CACHE SET: cache id: 0

---------------------------------------------------------------------------
File ~/miniconda3/lib/python3.11/site-packages/polars/lazyframe/frame.py:1943, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1940 if background:
   1941     return InProcessQuery(ldf.collect_concurrently())
-> 1943 return wrap_df(ldf.collect())

ColumnNotFoundError: Level

Issue description

I have two LazyFrames created from different functions, but with identical schemas.

concat works fine if I collect them first. However, concat on the LazyFrames fails.

The column Level in the error was present during both LazyFrames' construction, but was dropped from both before the concat.

Expected behavior

pl.concat([df1, df2]).collect() produces the same result as pl.concat([df1.collect(), df2.collect()]).

Installed versions

--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python:               3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.6
numpy:                1.26.1
openpyxl:             3.1.2
pandas:               2.1.2
pyarrow:              11.0.0
pydantic:             2.3.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           <not installed>

acowlikeobject avatar Apr 04 '24 04:04 acowlikeobject

Does it work on polars==0.20.16? I'm encountering a similar-but-possibly-different issue (still investigating/trying to make an MRE) and mine works on 0.20.16 but not later versions...

aut0clave avatar Apr 04 '24 23:04 aut0clave

@aut0clave

You could try narrow down what optimization is causing it which may help e.g. .collect(projection_pushdown=False)

There was some significant work done after 0.20.16 which caused a few issues:

  • https://github.com/pola-rs/polars/issues/15442#issuecomment-2032419969

cmdlineluser avatar Apr 05 '24 00:04 cmdlineluser

Possibly related to #12917?

owenprough-sift avatar Apr 05 '24 13:04 owenprough-sift

.collect(projection_pushdown=False)

@aut0clave @cmdlineluser Your intuition was correct. collectafter the concat works fine in 0.20.16 (throws an error in 0.20.18).

In 0.20.18, .collect(projection_pushdown=False) works fine.

acowlikeobject avatar Apr 06 '24 01:04 acowlikeobject

@acowlikeobject Did you use struct.field or struct[xxx]?

reswqa avatar Apr 07 '24 07:04 reswqa

@acowlikeobject Did you use struct.field or struct[xxx]?

@reswqa I didn't use structs at all in the construction of these dataframes.

acowlikeobject avatar Apr 07 '24 09:04 acowlikeobject

This is not a p-high issue if OP hasn't come with a reproducable example. We cannot do anything atm. @acowlikeobject can you try come up with an example. We need to be able to reproduce it to be able to fix it.

ritchie46 avatar Apr 21 '24 09:04 ritchie46

@ritchie46 Sadly, I'm not likely going to have time in the next few days. I'm happy to close this out, and maybe it'll get resolved when one of the related issues tagged in this thread are fixed.

acowlikeobject avatar Apr 21 '24 14:04 acowlikeobject