polars icon indicating copy to clipboard operation
polars copied to clipboard

Implement `.get_column()` for LazyFrames (i.e. Lazy Series)

Open DrMaphuse opened this issue 1 year ago • 3 comments

Problem Description

Sometimes there are situations in which a Series is needed rather than a DataFrame, such as when using .is_in() to compare columns in two different DataFrames.

In Lazy mode, this currently requires me to .collect() the desired column into a DataFrame and then use .get_column().

It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries.

DrMaphuse avatar Sep 01 '22 15:09 DrMaphuse

I'm not sure if this answers your question:

when using .is_in() to compare columns in two different DataFrames.

I don't know what use case you're describing here, but a generic way to compare columns across multiple dataframes is by performing a join and then using pl.col() to select the columns from the left and right tables.

It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries.

pl.col() is a lazy expression representing a columnar series with homogenous type?

OneRaynyDay avatar Sep 01 '22 15:09 OneRaynyDay

You are correct in that my use case could be covered with a join (or anti-join).

I still think that it would be useful to have this functionality, since it already does exist in eager mode, and it is extremely useful to be able to switch effortlessly between eager and lazy modes.

Surely there are lots of use cases for lazy series that can be used as standalone objects by using s = df.get_column('A'). pl.col() doesn't work for this unless I'm missing something.

DrMaphuse avatar Sep 01 '22 16:09 DrMaphuse

You must add a context if you want to access columns from another LazyFrame.

There might be valid use cases, but for filtering this definitely is an antipattern. semi/anitjoins are highly performant and optimized for this purpose.

ritchie46 avatar Sep 02 '22 07:09 ritchie46

I've tried to create an example on how to do the .is_in() lookup on two separate lazyframes. Does this help you @DrMaphuse ?

df = pl.DataFrame(
    {
        "a": [0, 1, 2, 3, 4],
        "b": [5, 6, 7, 8, 9],
        "c": [10, 11, 12, 13, 14],
    }
).lazy()
lookup = pl.DataFrame({"lookup": [1, 4]}).lazy()

(
    df
    .with_context(lookup)
    .with_column(
        pl.when(pl.col('a').is_in(pl.col("lookup"))).then("yeah").otherwise("boo").alias('is_in')
    )
).collect()

If not, can you maybe give an example?

YuRiTan avatar Oct 18 '22 14:10 YuRiTan