polars
polars copied to clipboard
Implement `.get_column()` for LazyFrames (i.e. Lazy Series)
Problem Description
Sometimes there are situations in which a Series is needed rather than a DataFrame, such as when using .is_in()
to compare columns in two different DataFrames.
In Lazy mode, this currently requires me to .collect()
the desired column into a DataFrame and then use .get_column()
.
It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries
.
I'm not sure if this answers your question:
when using .is_in() to compare columns in two different DataFrames.
I don't know what use case you're describing here, but a generic way to compare columns across multiple dataframes is by performing a join and then using pl.col()
to select the columns from the left and right tables.
It would be nice to have lazy access to individual columns as well, i.e. to have something like LazySeries.
pl.col()
is a lazy expression representing a columnar series with homogenous type?
You are correct in that my use case could be covered with a join (or anti-join).
I still think that it would be useful to have this functionality, since it already does exist in eager mode, and it is extremely useful to be able to switch effortlessly between eager and lazy modes.
Surely there are lots of use cases for lazy series that can be used as standalone objects by using s = df.get_column('A')
. pl.col()
doesn't work for this unless I'm missing something.
You must add a context if you want to access columns from another LazyFrame
.
There might be valid use cases, but for filtering this definitely is an antipattern. semi/anitjoins are highly performant and optimized for this purpose.
I've tried to create an example on how to do the .is_in()
lookup on two separate lazyframes. Does this help you @DrMaphuse ?
df = pl.DataFrame(
{
"a": [0, 1, 2, 3, 4],
"b": [5, 6, 7, 8, 9],
"c": [10, 11, 12, 13, 14],
}
).lazy()
lookup = pl.DataFrame({"lookup": [1, 4]}).lazy()
(
df
.with_context(lookup)
.with_column(
pl.when(pl.col('a').is_in(pl.col("lookup"))).then("yeah").otherwise("boo").alias('is_in')
)
).collect()
If not, can you maybe give an example?
@YuRiTan
Your example is close to something I'm trying to do. However, I want to return another column's value in the .then
clause from the lookup
table (let's call this column 'price'). In this way, I want to use the lookup
column to match on records from the a
column, then return the corresponding price.
I've tried this by putting pl.col('price')
in the .then
clause, but I receive this error:
ShapeError: shapes of
maskand
otherare not suitable for
zip_with operation
Any help appreciated. Thanks.
Hi @E-WebbV ,
The error means that the data frame column and lookup column do not have the same length. This is necessary when you want to fill the vale from the pl.col('price')
in the .then()
clause.
with equal lengths, it works:
df = pl.DataFrame(
{
"a": [0, 1, 2, 3, 4],
"b": [5, 6, 7, 8, 9],
"c": [10, 11, 12, 13, 14],
}
).lazy()
lookup = pl.DataFrame({"lookup": [1, 4, 5, 6, 7]}).lazy()
(
df
.with_context(lookup)
.with_columns(
pl.when(pl.col('a').is_in(pl.col("lookup"))).then(pl.col("lookup")).otherwise("boo").alias('is_in')
)
).collect()
shape: (5, 4)
┌─────┬─────┬─────┬───────┐
│ a ┆ b ┆ c ┆ is_in │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 0 ┆ 5 ┆ 10 ┆ boo │
│ 1 ┆ 6 ┆ 11 ┆ 4 │
│ 2 ┆ 7 ┆ 12 ┆ boo │
│ 3 ┆ 8 ┆ 13 ┆ boo │
│ 4 ┆ 9 ┆ 14 ┆ 7 │
└─────┴─────┴─────┴───────┘
But looking at what you're trying to do, it sounds more like a join
to me? Joins are also a lot faster compared to lookups in dataframes from other contexts. If not, could you please provide a minimal working example to show what you mean?
Closing this as Expr
is equivalent to a lazy Series and LazyFrame.select
exists.