DataAPI.jl
DataAPI.jl copied to clipboard
Column selectors should guarantee column order is preserved
I think that all column selectors (other than arrays) should guarantee that the column order in the original table is preserved. One would certainly expect that to be the case for Between
, though it's not explicitly mentioned in the docstring. It would be a bummer if you had foo(x, y) = 2x .+ y
but Between(:x, :y) => foo
happened to lower to [:y, :x] => foo
instead of [:x, :y] => foo
.
And I think it makes sense to guarantee column order preservation for the other selectors. E.g.
df = DataFrame(a=1, b=2, c=3)
select(df, Not(:b) => foo)
should be guaranteed to lower to
select(df, [:a, :c] => foo)
rather than
select(df, [:c, :a] => foo)
I'm not totally certain the best way to specify the column ordering properties of Cols
, but I think this specification makes sense:
- Individual column selectors inside
Cols
are first lowered to (ordered) arrays.- The lowering of the individual column selectors (except for arrays) follows the rule above that table column order should be preserved.
-
Cols
is then lowered as follows:Cols(A, B, C) ==> [A, B\A, C\(A ∪ B)]
(where the arguments on the right side are splatted into the array).
Since setdiff
on arrays preserves the order of the first argument to setdiff
, we get the following behavior:
df = DataFrame(a=1, b=2, c=3)
Cols([:c, :b], [:a, :b]) == [:c, :b, :a]
Cols(r"[bc]", r"[ab]") == [:b, :c, :a]
What you propose is exactly how it is implemented in DataFrames.jl (unless I made a bug in the code).
Essentially the rule can be stated that: column selectors are evaluated left to right and if a duplicate is encountered it is ignored.
What you propose is exactly how it is implemented in DataFrames.jl
Agreed, that is how it is currently implemented in DataFrames.jl. I just thought it might be a good idea to make the order guarantee explicit in DataAPI.jl.
Going back to this issue - would you care to make a PR implementing the proposed changes? Thank you!