polars
polars copied to clipboard
`update` DataFrame rows based on other DataFrame
Problem description
afaik, there currently is no way to update rows in a DataFrame based on another DataFrame My current workaround looks like this:
import polars as pl
df1 = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pl.DataFrame({"a": [2, 3], "b": [8, 9]})
(
df1
.join(df2, on="a", how="left")
.with_column(
pl.when(pl.col("b_right").is_not_null())
.then(pl.col("b_right"))
.otherwise(pl.col("b"))
.alias("b")
)
.drop("b_right")
)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 8 │
│ 3 ┆ 9 │
└─────┴─────┘
There might be another way for this, but I don't think there is a good way to do this?! (there is even a bug in the above code if df2
already has a column b_right
or if col b
contains nulls)
It would be nice to have something like update
(available in pandas) which would look like this:
df1.update(other=df2, on="a")
This could also be extended with join
, columns
and other arguments to make it more flexible.
Your 'workaround' is similar to what we would do internally if we'd implemented it.
A join is the right way to go. You should be able to switch-out the when.then.otherwise
for the more recently-introduced coalesce
, for a more compact representation:
.with_column(
pl.coalesce([pl.col("b_right"),pl.col("b")]).alias("b")
)
@ritchie46 @alexander-beedie thank you for your feedback!
coalesce
is interesting, did not know that exists.
I guess it is often hard to determine if adding certain features adds real value in terms of usability or is more of a gimmick / unnecessary noise.
I personally would like this feature, because working with data is not always about performance, but also about creating "clean code" and making it easy to understand what is going on. One aspect of this imo is creating methods where the INTENTION is immediately clear!
examples
- INTENTION immediately clear:
-
.min
,.max
,first
, ... -
.groupby
+.agg
,.pivot
, ... -
.sort
,join
, ...
-
- hard to unterstand
-
.join(...).with_column(when.then.otherwise.alias).drop
- oh, you are telling me this is just updating a column? why is there no method for that like:
-
.update
-
.col().arr.lengths() != .col().arr.unique().arr.lenghts()
- oh, you are telling me this is just checking for duplicates? why is there no method for that like:
-
.has_duplicates
(#6137)
-
Updating one dataframe using another is important. We request the Pandas .update function functionality wherein we update one dataframe using values from another.
I am using the above code with coalesce function. However, I don't know how to loop through the set of columns and update them. Manually inserting column names like above is not a solution when multiple people are using your code for creating custom tables.
Hi! I had asked the question on updating one dataframe using another (like Pandas update) on Stack Overflow. Someone has answered it well. Here is the link: https://stackoverflow.com/questions/75324971/python-polars-update-one-dataframe-with-another-like-pandas-update-functio/75328301#75328301
Perhaps .set_at_idx()
partially solves the issue since it accepts lists for indices and values. Also how is updating a column in a data frame in 10 lines of code in 2023 still a thing - what am I missing?
@jmakov, the aim is to update one dataframe using values from another, not just a column. You are familiar with pandas .update() method, which does exactly this. In my case, I need this functionality to avoid suppressing table rows / columns due to absence of database records pertaining to specific table row / column. You have the code now (link above). If you could improve it (less code, or faster code or both), that would be great.
This sort of works:
def _update(self, df_other, join_columns):
# The columns that will be updated
columns = [c for c in df_other.columns if c not in join_columns]
return (
self.join(df_other, how='left', on=join_columns, suffix='_NEW')
.with_columns(**{
c: pl.coalesce([pl.col(c+'_NEW'), pl.col(c)]) for c in columns # <-This updates the columns
})
.select(pl.all().exclude('^.*_NEW$')) # <- this drops the temporary '*_NEW' columns
)
you could then either:
pl.DataFrame.update = _update # <- do this to hackyly add it to as a method to the main polars namespace (not recommended)
# or
df.pipe(update, df_other, ...)
Just a tidy up of the previous comment if anyone wants it:
def update(self:pl.DataFrame, df_other:pl.DataFrame, join_columns:list[str])->pl.DataFrame:
'''Updates dataframe with the values in df_other after joining on the join_columns'''
# The columns that will be updated
columns = [c for c in df_other.columns if c not in join_columns]
df_ans = (self
.join(df_other, how='left', on=join_columns, suffix='_NEW')
.with_columns(**{
c: pl.coalesce([pl.col(c+'_NEW'), pl.col(c)]) for c in columns # <-This updates the columns
}).select(
pl.all().exclude('^.*_NEW$') # <- this drops the temporary '*_NEW' columns
)
)
return df_ans