DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Autogenerated suffixes can clash

Open knuesel opened this issue 3 years ago • 1 comments

I was bit by this issue in a case that looked like this:

julia> df = DataFrame(x=1:3);

julia> select(df, :x => (x->2x), :x => (x->3x))
ERROR: ArgumentError: duplicate output column name: :x_function

I guess this works as documented:

the generated name is created by concatenating source column name and function name by default (see examples below).

But it would be nice if name auto-generation was smarter, to guarantee unique names.

Another case that is maybe worse as it overwrites data:

julia> f(x) = 2x;

julia> df = DataFrame(x=1:3, x_f=0)
3×2 DataFrame
 Row │ x      x_f   
     │ Int64  Int64 
─────┼──────────────
   1 │     1      0
   2 │     2      0
   3 │     3      0

julia> transform!(df, :x => f)
3×2 DataFrame
 Row │ x      x_f   
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      4
   3 │     3      6

Finally, a case that is a bit contrived:

julia> g(x) = 3x;

julia> f_g(x) = 1;

julia> df = DataFrame(x=1:3, x_f=0);

julia> transform(df, :x => f_g, :x_f => g)
ERROR: ArgumentError: duplicate output column name: :x_f_g

knuesel avatar Mar 03 '21 15:03 knuesel

All this works as expected. Still we may re-consider adding makeunique kwarg to these functions that is why I keep it open.

The general idea is that it is safer in production code to throw error than silently modify generated (or passed by the user explicitly) column name.

bkamins avatar Mar 03 '21 15:03 bkamins