append! fails when a column references another
Hi
Please see the example below. The way append! is implemented this fails.
I searched for existing issues, as I expected this has come up in the past (and maybe the behavior is intentional?
using DataFrames
df=DataFrame(a=1:10)
#works
df.b = deepcopy(df.a)
append!(df,deepcopy(df))
#works
df.c = 4*df.a #it would think it is common to assing new columns like this
append!(df,deepcopy(df))
#fails
df.c = df.a #I guess this line does NOT make a copy of df.a whereas 4*df.a does?
append!(df,deepcopy(df))
the fact that df.c = df.a does not make a copy is documented.
The question is, what would you expect/prefer:
- throwing error (because you are trying to
append!to a data frame with aliased columns) - performing un-aliasing and making
append!work
@nalimilan - what do you think?
The question is, what would you expect/prefer:
- throwing error (because you are trying to
append!to a data frame with aliased columns)- performing un-aliasing and making
append!work
I have no preference.
@nalimilan - so we currently do the former (throw an error). Do you think we should change this and de-alias in append!/push! and related? (in fact it would make the code simpler, as currently we have a quite complex recovery code in case of alias)
I would find it a bit weird for append! to dealias columns. For example if you did c = df.c before calling append! then it would no longer be the same object. This may can unexpected as it's completely unrelated to the goal of the append! operation.
If we wanted to fix this, it would sound more logical to have df.c = df.a make a copy when aliasing is detected. But that would be slightly breaking.
it would sound more logical to have
df.c = df.amake a copy when aliasing is detected
This is a point to consider ideed. I wonder if anyone could be affected by such behavior. This, in general, would mean that we probably should do de-aliasing in general when changing the columns of a data frame. The question is if we see scenarios in which someone might actively want to store aliased columns in a data frame?
I imagine there can be valid use cases for this, like if you need to store columns which could have different values but happen to be equal under different names. But these sound really marginal, and potentially dangerous. We could ask on Slack.
the fact that
df.c = df.adoes not make a copy is documented.
This settles it for me. A simple rule is better than a complicated one.
OK. I asked on Slack. Now a comment on the design of this if we wanted de-aliasing. We would keep an IdDict of columns in a DataFrame structure. This would allow us to quickly identify if column is aliased and copy it if needed.
This settles it for me. A simple rule is better than a complicated one.
This was an initial idea - to have a simple rule. But we want with @nalimilan to doublecheck what users prefer. Also - I will probably make an update to the docs warning about aliasing even stronger than we do now if we keep current behavior.
Comment from Slack by: maybe we could somehow indicate aliased columns when showing data frame.
I think it's good to stay simple/consistent. that df.a = df.b aliases can be a bit surprising, but once you know about it it's fine. I think it would be more surprising if I construct my columns in a way such that I expect them to share memory, but somehow during processing they de-alias. If this option to create a copy and de-alias is implemented, I think it should at least emit a warning to the user.
That said, the usage of append! and push! in the OP could be made to work unambiguously (without de-aliasing) if the new items to be added into the aliased columns are themselves ===
maybe we could somehow indicate aliased columns when showing data frame.
this does generally seems nice, independently of whatever is decided here
I'm not sure indicating aliases when printing will help a lot since with large data frames, most columns are not visible. Could we just improve append! so that in case of errors it checks for the presence of aliases and reports that?
Could we just improve
append!so that in case of errors it checks for the presence of aliases and reports that?
This is a good idea. I will implement this
Also for push! and related functions.
See also https://bkamins.github.io/julialang/2023/09/15/copying.html