polars icon indicating copy to clipboard operation
polars copied to clipboard

`update` DataFrame rows based on other DataFrame

Open Julian-J-S opened this issue 2 years ago • 3 comments

Problem description

afaik, there currently is no way to update rows in a DataFrame based on another DataFrame My current workaround looks like this:

import polars as pl
df1 = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pl.DataFrame({"a": [2, 3], "b": [8, 9]})
(
    df1
    .join(df2, on="a", how="left")
    .with_column(
        pl.when(pl.col("b_right").is_not_null())
        .then(pl.col("b_right"))
        .otherwise(pl.col("b"))
        .alias("b")
    )
    .drop("b_right")
)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 8   │
│ 3   ┆ 9   │
└─────┴─────┘

There might be another way for this, but I don't think there is a good way to do this?! (there is even a bug in the above code if df2 already has a column b_right or if col b contains nulls)

It would be nice to have something like update (available in pandas) which would look like this:

df1.update(other=df2, on="a")

This could also be extended with join, columns and other arguments to make it more flexible.

Julian-J-S avatar Jan 13 '23 19:01 Julian-J-S

Your 'workaround' is similar to what we would do internally if we'd implemented it.

ritchie46 avatar Jan 13 '23 19:01 ritchie46

A join is the right way to go. You should be able to switch-out the when.then.otherwise for the more recently-introduced coalesce, for a more compact representation:

  .with_column(
      pl.coalesce([pl.col("b_right"),pl.col("b")]).alias("b")
  )

alexander-beedie avatar Jan 13 '23 21:01 alexander-beedie

@ritchie46 @alexander-beedie thank you for your feedback! coalesce is interesting, did not know that exists.

I guess it is often hard to determine if adding certain features adds real value in terms of usability or is more of a gimmick / unnecessary noise.

I personally would like this feature, because working with data is not always about performance, but also about creating "clean code" and making it easy to understand what is going on. One aspect of this imo is creating methods where the INTENTION is immediately clear!

examples

  • INTENTION immediately clear:
    • .min, .max, first, ...
    • .groupby + .agg, .pivot, ...
    • .sort, join, ...
  • hard to unterstand
    • .join(...).with_column(when.then.otherwise.alias).drop
      • oh, you are telling me this is just updating a column? why is there no method for that like:
      • .update
    • .col().arr.lengths() != .col().arr.unique().arr.lenghts()
      • oh, you are telling me this is just checking for duplicates? why is there no method for that like:
      • .has_duplicates (#6137)

Julian-J-S avatar Jan 15 '23 14:01 Julian-J-S

Updating one dataframe using another is important. We request the Pandas .update function functionality wherein we update one dataframe using values from another.

I am using the above code with coalesce function. However, I don't know how to loop through the set of columns and update them. Manually inserting column names like above is not a solution when multiple people are using your code for creating custom tables.

sbnimbalkar avatar Feb 02 '23 14:02 sbnimbalkar

Hi! I had asked the question on updating one dataframe using another (like Pandas update) on Stack Overflow. Someone has answered it well. Here is the link: https://stackoverflow.com/questions/75324971/python-polars-update-one-dataframe-with-another-like-pandas-update-functio/75328301#75328301

sbnimbalkar avatar Feb 03 '23 08:02 sbnimbalkar

Perhaps .set_at_idx() partially solves the issue since it accepts lists for indices and values. Also how is updating a column in a data frame in 10 lines of code in 2023 still a thing - what am I missing?

jmakov avatar Feb 03 '23 21:02 jmakov

@jmakov, the aim is to update one dataframe using values from another, not just a column. You are familiar with pandas .update() method, which does exactly this. In my case, I need this functionality to avoid suppressing table rows / columns due to absence of database records pertaining to specific table row / column. You have the code now (link above). If you could improve it (less code, or faster code or both), that would be great.

sbnimbalkar avatar Feb 04 '23 03:02 sbnimbalkar

This sort of works:

def _update(self, df_other,  join_columns):

    # The columns that will be updated
    columns = [c for c in df_other.columns if c not in join_columns]
    return (
        self.join(df_other, how='left', on=join_columns, suffix='_NEW')      
        .with_columns(**{
            c: pl.coalesce([pl.col(c+'_NEW'), pl.col(c)]) for c in columns # <-This updates the columns
        })
        .select(pl.all().exclude('^.*_NEW$')) # <- this drops the temporary '*_NEW' columns
    )
    
  

you could then either:

pl.DataFrame.update = _update # <- do this to hackyly add it to as a method to the main polars namespace (not recommended)

# or

df.pipe(update, df_other, ...)

mkleinbort-ic avatar Feb 06 '23 15:02 mkleinbort-ic

Just a tidy up of the previous comment if anyone wants it:


def update(self:pl.DataFrame, df_other:pl.DataFrame,  join_columns:list[str])->pl.DataFrame:
    '''Updates dataframe with the values in df_other after joining on the join_columns'''
    
    # The columns that will be updated
    columns = [c for c in df_other.columns if c not in join_columns]
    
    df_ans = (self
        .join(df_other, how='left', on=join_columns, suffix='_NEW')      
        .with_columns(**{
            c: pl.coalesce([pl.col(c+'_NEW'), pl.col(c)]) for c in columns # <-This updates the columns
        }).select(
            pl.all().exclude('^.*_NEW$') # <- this drops the temporary '*_NEW' columns
           )
       )    
    return df_ans

mkleinbort-ic avatar Feb 06 '23 15:02 mkleinbort-ic