DataFramesMeta.jl icon indicating copy to clipboard operation
DataFramesMeta.jl copied to clipboard

silent error when using `,` in operations

Open vjd opened this issue 4 years ago • 11 comments

consider the use of @orderby below. Look at the last example. Somehow, the syntax allows the use of , does not error, but provides an incorrect result.

julia> dd = DataFrame(a = [3, 1, 2], b = [5,4, 6])
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     3      5
   2 │     1      4
   3 │     2      6

# works correctly
julia> @orderby(dd, :a, :b)
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      6
   3 │     3      5

# works correctly
julia> @orderby dd :a :b
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      6
   3 │     3      5

# works silenty, but incorrect
julia> @orderby dd :a, :b
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     3      5
   2 │     1      4
   3 │     2      6

vjd avatar Jan 09 '22 08:01 vjd

Weird, thanks for the issue report. I would have expected the last one to sort the tuples lexicographically, but it does not seem to do that.

pdeffebach avatar Jan 10 '22 01:01 pdeffebach

oh wait, it is sorting the tuple lexicographically. But within this it's sorting the arrays lexicographically. This is "right" in the sense that it's behaving as Julia behaves, but it is unintuitive behavior.

MacroTools.@macroexpand(@orderby dd :a, :b) |> MacroTools.prettify
:((DataFramesMeta).orderby(dd, DataFramesMeta.make_source_concrete([:b, :a]) => (((bat, falcon)->(falcon, bat)) => Symbol("##257"))))

with

function orderby(x::SubDataFrame, @nospecialize(args...))
    t = DataFrames.select(x, args...)
    x[sortperm(t), :]
end

so we see

julia> select(dd, DataFramesMeta.make_source_concrete([:b, :a]) => (((bat, falcon)->(falcon, bat)) => Symbol("##257")))
3×1 DataFrame
 Row │ ##257
     │ Tuple…
─────┼────────────────────────
   1 │ ([3, 1, 2], [5, 4, 6])
   2 │ ([3, 1, 2], [5, 4, 6])
   3 │ ([3, 1, 2], [5, 4, 6])

Sorry, i'll finish writing this up tomorrow

pdeffebach avatar Jan 10 '22 01:01 pdeffebach

Okay I'm back.

Here's whats going on. The transformation

([:b, :a]) => ((bat, falcon)->(falcon, bat))

creates a tuple of vectors. Because a tuple of vectors is functionally a scalar in DataFrames.jl's transformation language (i.e. not a table or a vector), it gets "spread" across rows, the same way

julia> df = DataFrame(a = [1, 2]);

julia> transform(df, :a => (t -> 1) => :b)
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     2      1

works.

So it's creating the data frame

3×1 DataFrame
 Row │ ##257
     │ Tuple…
─────┼────────────────────────
   1 │ ([3, 1, 2], [5, 4, 6])
   2 │ ([3, 1, 2], [5, 4, 6])
   3 │ ([3, 1, 2], [5, 4, 6])

and then sorting by the ##257 column. Which means the order doesn't change (or is undefined ?) after the sort happens.

I agree that it's very confusing that @orderby df :a, :b looks so similar to @orderby(df, :a, :b).

One potential solution is to disallow expressions that look like Expr(:tuple, ...) to avoid this problem... but I'm worried that could have unforseen side effects.

Now that you understand what DataFramesMeta.jl is doing behind the scenes, do you have more thoughts on how to approach this issue?

pdeffebach avatar Jan 11 '22 22:01 pdeffebach

So in essence:

@orderby df :a, :b

is the same as

@orderby df tuple(:a, :b)

due to how Julia parser works.

Right?

bkamins avatar Jan 11 '22 22:01 bkamins

Yes, exactly.

pdeffebach avatar Jan 11 '22 23:01 pdeffebach

I'm not sure what the correct solution here would be. Maybe we document the use of tuple in such cases, which is essentially equivalent to writing the syntax with the parenthesis. So maybe just Advocate the use of one common syntax rather than two different ways of writing? Given a choice, to be consistent with dataframes.jl, it is probably easier to write the macros with parentheses and start deprecating the one without. It will be consistent also with the syntax that is used in R where most of the users are coming from

vjd avatar Jan 11 '22 23:01 vjd

it is probably easier to write the macros with parentheses and start deprecating the one without

This is what I do (i.e. I always use a function call style). But we cannot "deprecate it" - this is an intentional feature of Julia not of DataFramesMeta.jl. Probably we should highlight to users that there are two styles allowed by Julia and explained them better.

bkamins avatar Jan 12 '22 07:01 bkamins

So maybe just Advocate the use of one common syntax rather than two different ways of writing?

Yes, this doesn't fully solve the problem, since @mymacro a, b is always going to be parsed as a tuple by default. Even if the recommended way is @mymacro(a, b).

it is probably easier to write the macros with parentheses and start deprecating the one without.

The reason we support the non-parentheses version is because of @passmissing and other macro flags. Ironically, for the same problems that we are writing out here.

@rorderby(df, @passmissing :y, :b)

will get parsed as

@rorderby(df, @passmissing(tuple(:y, :b))

the solution to this problem was to allow multiple lines so the macro-flags could get properly parsed

@rorderby df begin 
    @passmissing :y 
    :b 
end

It will be consistent also with the syntax that is used in R where most of the users are coming from

Yes but @rtransform df :a or inside a chain block is closer to Stata, which is also a market-share to appeal to. Plus, I dislike the requirement of parentheses in dplyr, because I think they discourage complicated transformations. I was actually hoping to get rid of all the parentheses in documentation to get people to use the begin ... end version.

I think the way forward is to disallow @rtransform :a, :b so users dont get confused.

pdeffebach avatar Jan 12 '22 16:01 pdeffebach

I think the way forward is to disallow @rtransform :a, :b so users dont get confused.

I'm not sure that helps, as you can of course disallow a tuple, but you would also need to disallow macros that work on tuples. This is because @rtransform @byrow something, other_thing would resolve to one macro expression as an argument, and I'm not sure that you don't get into trouble if you also disallow tuples within a macro as well.

One solution would be to disallow multiple arguments except listed in a begin end block, but that feels overly restrictive.

The premise of this whole issue is that there is a silent "error", but of course it's not that, it's just confusing syntax. But given that Julia's macro parsing rules are fixed and you can't discriminate between a call with or without parentheses inside the macro, I think this is just not avoidable.

jkrumbiegel avatar Jan 13 '22 21:01 jkrumbiegel

But given that Julia's macro parsing rules are fixed and you can't discriminate between a call with or without parentheses inside the macro

This was my point above. I think it is best to very clearly explain the options and their consequences in the manual. @pdeffebach - if you prefer I can submit a PR.

bkamins avatar Jan 14 '22 06:01 bkamins

Maybe we need a "Gotcha's" section of the docs? Could be a good opportunity to beef up the docs before 1.0.

pdeffebach avatar Mar 01 '24 15:03 pdeffebach