polars icon indicating copy to clipboard operation
polars copied to clipboard

Implement `.get_column()` for LazyFrames (i.e. Lazy Series)

Open DrMaphuse opened this issue 2 years ago • 3 comments

Problem Description

Sometimes there are situations in which a Series is needed rather than a DataFrame, such as when using .is_in() to compare columns in two different DataFrames.

In Lazy mode, this currently requires me to .collect() the desired column into a DataFrame and then use .get_column().

It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries.

DrMaphuse avatar Sep 01 '22 15:09 DrMaphuse

I'm not sure if this answers your question:

when using .is_in() to compare columns in two different DataFrames.

I don't know what use case you're describing here, but a generic way to compare columns across multiple dataframes is by performing a join and then using pl.col() to select the columns from the left and right tables.

It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries.

pl.col() is a lazy expression representing a columnar series with homogenous type?

OneRaynyDay avatar Sep 01 '22 15:09 OneRaynyDay

You are correct in that my use case could be covered with a join (or anti-join).

I still think that it would be useful to have this functionality, since it already does exist in eager mode, and it is extremely useful to be able to switch effortlessly between eager and lazy modes.

Surely there are lots of use cases for lazy series that can be used as standalone objects by using s = df.get_column('A'). pl.col() doesn't work for this unless I'm missing something.

DrMaphuse avatar Sep 01 '22 16:09 DrMaphuse

You must add a context if you want to access columns from another LazyFrame.

There might be valid use cases, but for filtering this definitely is an antipattern. semi/anitjoins are highly performant and optimized for this purpose.

ritchie46 avatar Sep 02 '22 07:09 ritchie46

I've tried to create an example on how to do the .is_in() lookup on two separate lazyframes. Does this help you @DrMaphuse ?

df = pl.DataFrame(
    {
        "a": [0, 1, 2, 3, 4],
        "b": [5, 6, 7, 8, 9],
        "c": [10, 11, 12, 13, 14],
    }
).lazy()
lookup = pl.DataFrame({"lookup": [1, 4]}).lazy()

(
    df
    .with_context(lookup)
    .with_column(
        pl.when(pl.col('a').is_in(pl.col("lookup"))).then("yeah").otherwise("boo").alias('is_in')
    )
).collect()

If not, can you maybe give an example?

YuRiTan avatar Oct 18 '22 14:10 YuRiTan

@YuRiTan Your example is close to something I'm trying to do. However, I want to return another column's value in the .then clause from the lookup table (let's call this column 'price'). In this way, I want to use the lookup column to match on records from the a column, then return the corresponding price.

I've tried this by putting pl.col('price') in the .then clause, but I receive this error: ShapeError: shapes of maskandotherare not suitable forzip_with operation

Any help appreciated. Thanks.

E-WebbV avatar Jun 02 '23 12:06 E-WebbV

Hi @E-WebbV ,

The error means that the data frame column and lookup column do not have the same length. This is necessary when you want to fill the vale from the pl.col('price') in the .then() clause.

with equal lengths, it works:

df = pl.DataFrame(
    {
        "a": [0, 1, 2, 3, 4],
        "b": [5, 6, 7, 8, 9],
        "c": [10, 11, 12, 13, 14],
    }
).lazy()
lookup = pl.DataFrame({"lookup": [1, 4, 5, 6, 7]}).lazy()

(
    df
    .with_context(lookup)
    .with_columns(
        pl.when(pl.col('a').is_in(pl.col("lookup"))).then(pl.col("lookup")).otherwise("boo").alias('is_in')
    )
).collect()
shape: (5, 4)
┌─────┬─────┬─────┬───────┐
│ a   ┆ b   ┆ c   ┆ is_in │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 0   ┆ 5   ┆ 10  ┆ boo   │
│ 1   ┆ 6   ┆ 11  ┆ 4     │
│ 2   ┆ 7   ┆ 12  ┆ boo   │
│ 3   ┆ 8   ┆ 13  ┆ boo   │
│ 4   ┆ 9   ┆ 14  ┆ 7     │
└─────┴─────┴─────┴───────┘

But looking at what you're trying to do, it sounds more like a join to me? Joins are also a lot faster compared to lookups in dataframes from other contexts. If not, could you please provide a minimal working example to show what you mean?

YuRiTan avatar Jun 16 '23 07:06 YuRiTan

Closing this as Expr is equivalent to a lazy Series and LazyFrame.select exists.

stinodego avatar Jan 30 '24 13:01 stinodego