DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

append! fails when a column references another

Open kafisatz opened this issue 2 years ago • 13 comments

Hi

Please see the example below. The way append! is implemented this fails.

I searched for existing issues, as I expected this has come up in the past (and maybe the behavior is intentional?

using DataFrames
df=DataFrame(a=1:10)

#works
    df.b = deepcopy(df.a)
    append!(df,deepcopy(df))

#works
    df.c = 4*df.a #it would think it is common to assing new columns like this
    append!(df,deepcopy(df))

#fails
    df.c = df.a #I guess this line does NOT make a copy of df.a whereas 4*df.a does?
    append!(df,deepcopy(df))


kafisatz avatar Sep 04 '23 15:09 kafisatz

the fact that df.c = df.a does not make a copy is documented.

The question is, what would you expect/prefer:

  • throwing error (because you are trying to append! to a data frame with aliased columns)
  • performing un-aliasing and making append! work

@nalimilan - what do you think?

bkamins avatar Sep 04 '23 16:09 bkamins

The question is, what would you expect/prefer:

  • throwing error (because you are trying to append! to a data frame with aliased columns)
  • performing un-aliasing and making append! work

I have no preference.

kafisatz avatar Sep 04 '23 16:09 kafisatz

@nalimilan - so we currently do the former (throw an error). Do you think we should change this and de-alias in append!/push! and related? (in fact it would make the code simpler, as currently we have a quite complex recovery code in case of alias)

bkamins avatar Sep 04 '23 18:09 bkamins

I would find it a bit weird for append! to dealias columns. For example if you did c = df.c before calling append! then it would no longer be the same object. This may can unexpected as it's completely unrelated to the goal of the append! operation.

If we wanted to fix this, it would sound more logical to have df.c = df.a make a copy when aliasing is detected. But that would be slightly breaking.

nalimilan avatar Sep 08 '23 07:09 nalimilan

it would sound more logical to have df.c = df.a make a copy when aliasing is detected

This is a point to consider ideed. I wonder if anyone could be affected by such behavior. This, in general, would mean that we probably should do de-aliasing in general when changing the columns of a data frame. The question is if we see scenarios in which someone might actively want to store aliased columns in a data frame?

bkamins avatar Sep 09 '23 14:09 bkamins

I imagine there can be valid use cases for this, like if you need to store columns which could have different values but happen to be equal under different names. But these sound really marginal, and potentially dangerous. We could ask on Slack.

nalimilan avatar Sep 09 '23 15:09 nalimilan

the fact that df.c = df.a does not make a copy is documented.

This settles it for me. A simple rule is better than a complicated one.

jariji avatar Sep 09 '23 17:09 jariji

OK. I asked on Slack. Now a comment on the design of this if we wanted de-aliasing. We would keep an IdDict of columns in a DataFrame structure. This would allow us to quickly identify if column is aliased and copy it if needed.

This settles it for me. A simple rule is better than a complicated one.

This was an initial idea - to have a simple rule. But we want with @nalimilan to doublecheck what users prefer. Also - I will probably make an update to the docs warning about aliasing even stronger than we do now if we keep current behavior.

bkamins avatar Sep 09 '23 17:09 bkamins

Comment from Slack by: maybe we could somehow indicate aliased columns when showing data frame.

bkamins avatar Sep 09 '23 20:09 bkamins

I think it's good to stay simple/consistent. that df.a = df.b aliases can be a bit surprising, but once you know about it it's fine. I think it would be more surprising if I construct my columns in a way such that I expect them to share memory, but somehow during processing they de-alias. If this option to create a copy and de-alias is implemented, I think it should at least emit a warning to the user.

That said, the usage of append! and push! in the OP could be made to work unambiguously (without de-aliasing) if the new items to be added into the aliased columns are themselves ===

maybe we could somehow indicate aliased columns when showing data frame.

this does generally seems nice, independently of whatever is decided here

adienes avatar Sep 11 '23 01:09 adienes

I'm not sure indicating aliases when printing will help a lot since with large data frames, most columns are not visible. Could we just improve append! so that in case of errors it checks for the presence of aliases and reports that?

nalimilan avatar Sep 11 '23 07:09 nalimilan

Could we just improve append! so that in case of errors it checks for the presence of aliases and reports that?

This is a good idea. I will implement this

Also for push! and related functions.

bkamins avatar Sep 11 '23 09:09 bkamins

See also https://bkamins.github.io/julialang/2023/09/15/copying.html

bkamins avatar Sep 15 '23 08:09 bkamins