DataFrames.jl
DataFrames.jl copied to clipboard
Add a validator for src => transform => dst
In order to improve readability of error messages
@pdeffebach - can you give me please an example when we produce something that is not clear? I have checked and for incorrect selectors we produce:
julia> select(df, 'a')
ERROR: ArgumentError: Unrecognized column selector: a
I think the best suggestion for this was a function to help construct pairs.
make_pair(source = [:a, :b], fun = f, dest = AsTable)`
transform(df, make_pair(...))
There should also be functionality added to make typeof(source)
, typeof(f(source))
, and typeof(dest)
work.
This might also help with #2870 if the function syntax would make it easier to programatically create a list of args...
to pass to transform
and the like.
Could you please explain what is the advantage of:
make_pair(source = [:a, :b], fun = f, dest = AsTable)
over
[:a, :b] => f => AsTable
?
For me it seems more verbose and less readable.
There should also be functionality added to make
typeof(source)
,typeof(f(source))
, andtypeof(dest)
work.
I am not fully clear what you mean here? Could you please provide an example of assumed input and expected output?
Thank you!
It is only advantageous if the new function (or package) can provide additional error checking, type checking, and validation. Ideally, it could tell you what part of your pair is invalid or needs broadcasted.
Ideally, it could tell you what part of your pair is invalid or needs broadcasted.
OK - now I understand.
The problem is that it is extremely hard (if not impossible). The problem is that Julia is very flexible what you can define to work (e.g. consider https://docs.julialang.org/en/v1/manual/methods/#Function-like-objects-1 which are objects, but are callable - you cannot detect such things by checking types unfortunately)
But I will think what can be done.
I'll call on @pdeffebach and @CameronBieganek to add their thoughts too.
I'm afraid I don't know enough to be helpful with the implementation details. Could you start by supporting more basic types and just print a warning "Validation not yet supported" for functors and such? I imagine advanced users would construct the pairs themselves anyway.
I think another problem is that because the rules are so complicated, having a validator be separate from the place where the error is would lead to code duplication and, at worst, divergent behavior.
One thing we could also do is export a prettier version of normalize_selection
, which is where a lot of the pre-processing happens. This would, in effect, be a validator. I will check if this is feasible.
I will check if this is feasible.
We would just need to allow passing a data frame, or grouped data frame instead of an index. This is fully feasible.
Plus we need to warn users that normalize_selection
works on a single transform (because .=>
stuff happens before).
This also means that when we soon allow Not(:x) .=> fun
any errors in this would not be caught as it has to be resolved before being passed to normalize_selection
.
In short - I agree that it is hard. I have added one answer on SO https://stackoverflow.com/questions/69194267/failing-to-execute-column-transformation-in-dataframes-jl discussing the problem that is impossible to solve but is pretty common.
Did normalize_selection
get more features in another PR or is there really nothing that can be done to get better error checking on Pairs?
No. I haven't gotten around to adding more features to normalize_selection
. I'm re-opening this PR to keep track of it.
normalize_selection
is in the process of getting new features now. Maybe to help @pdeffebach / me can you give a list of examples of incorrect transformations and the error you get now. Then I can assess if such a transformation can be handled in a better way.
An example what I mean.
If you gave:
julia> combine(DataFrame(a=1), :a => x -> 2x => :b)
1×1 DataFrame
Row │ a_function
│ Pair…
─────┼────────────
1 │ [2]=>:b
which is most likely not what the user wanted but rather combine(DataFrame(a=1), :a => (x -> 2x) => :b)
was probably wanted.
My answer would be: We cannot do anything about it. It is possible that this is what the user wanted (although unlikely). Also it is technically impossible to detect this kind of error.
If you gave:
julia> combine(DataFrame(a=1,b=2), [:a, :b] => identity)
ERROR: MethodError: no method matching identity(::Vector{Int64}, ::Vector{Int64})
which is most likely not what the user wanted but rather combine(DataFrame(a=1,b=2), [:a, :b] .=> identity)
was probably wanted.
My answer would be:
We cannot do anything about it. It is possible that identity
function could have a method taking two positional arguments. The error given is the best we can give.
The best examples I have are linked in the discourse post which prompted this issue. I am currently writing a documentation PR for the "Basic Usage of Transformation Functions" section which I hope will also help.
I have found in this thread only one example that you gave that uses src => transform => dst
minilanguage and it is:
transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)
The answer for this case is: it is impossible to give a better error message. By the time transform!
gets this specification it is the same as if you have written r"Temp" => ByRow(fahrenheit_to_celsius)
which is potentially a fully valid transformation specification.
As commented above - if you have any other problematic examples please let me know and I will check if they are fixable. I am asking for this because here we need to move from vague impressions (which are very important in general) to concrete things that we can evaluate if they can be fixed.
E.g. https://github.com/JuliaData/DataFrames.jl/issues/2867 was a concrete request and we have already fixed it.
iirc OP had confusion about eachcol
and similar operators.
eachcol(df) => f => :x
which should fail on normalize_selection
, I believe. But we don't expose normalize_selection
to the user so it's a difficult thing for new users to parse.
What is confusing here:
julia> select(df, eachcol(df) => sum => :x)
ERROR: ArgumentError: Unrecognized column selector: 1×1 DataFrameColumns
Row │ a
│ Int64
─────┼───────
1 │ 1
?
For me the error seems pretty explicit: you passed a column selector eachcol(df)
which is printed in the error message and this selector is invalid. How do you propose to improve the error message for this case? We potentially might list all classes of valid column selectors. Is this what you think should be changed?
One thing that is confusing is that the first argument of the pair is not what is passed to the function. In source
=> fun
, the source
is not passed, but instead whatever DataFrames looks up using source
. That lookup is invisible to the user, so it is difficult to determine if your source
looked up what your wanted and if your fun
is constructed so that it takes the inputs as the lookup spits them out. This was part of the problem with r"Temp" .=> ByRow(fahrenheit_to_celsius)
: the lookup with r"Temp"
was not creating an iterable vector like I thought so fahrenheit_to_celsius
wasn't constructed to take the inputs it should have. Similarly, you would think sum
would take eachcol(df)
no problem because sum
has a method for the output of eachcol(df)
, and that is easy to test. However, the invisible translation from column selector to columns obfuscates the type passing. I would like to see better documentation and error checking for this.
Another thing that is confusing, is how filter
and subset
are related to the transformation functions and each other. They are not mentioned in the transformation section, but they take a Pair
as input and transform the data frame just like the other transformation functions. Although, it seems that the syntax is not exactly the same. Consider the following example from the discourse post:
julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4×3 DataFrame
Row │ node x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 0
2 │ 2 0 0
3 │ 3 9 12
4 │ 4 5 8
julia> filter!(Not(:node) => ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching (::ByRow{var"#21#23"})(::Int64, ::Int64)
The first example under ?filter
shows constructing an anonymous function with row
, so I followed that example to construct my anonymous function by row
. I now want to restrict the filter to only look at certain columns, so I pass those columns to my function with a column selector. But again the first confusion rears its head when Not(:node) => ByRow(f)
does not pass a DataFrameRow
to f
, so my function is constructed to accept the wrong inputs.
I am asking for this because here we need to move from vague impressions (which are very important in general) to concrete things that we can evaluate if they can be fixed.
I am trying to help suggest fixes, but honestly I am still kind of confused on how everything works. I don't know the best way to fix all the hurdles I am presenting, but I doubt that rewording the existing error messages is the full solution to simplifying understanding of the mini-language syntax.
Sidenote I did:
-
] dev DataFrames
- Created a new branch
origin/nb/docs_transformations
- Edited
docs/src/man/basics.md
- Committed my changes to
origin/nb/docs_transformations
- Clicked "Publish Changes" in VSCode.
But I got an error:
> git push -u origin origin/nb/docs_transformations
remote: Permission to JuliaData/DataFrames.jl.git denied to nathanrboyer.
fatal: unable to access 'https://github.com/JuliaData/DataFrames.jl.git/': The requested URL returned error: 403
What is the correct way to submit my changes as a PR? Clicking "New Pull Request" on GitHub didn't seem to do anything.
We potentially might list all classes of valid column selectors. Is this what you think should be changed?
That might be a good idea, or at least a link. Although, even after reading the Indexing Doc, I would not have known it is meant to list what can be used as the first part of any transformation Pair
except that your specifically told me that in a message.
I was thinking that typeof(select(df, source))
might help me construct my function
correctly in the source => function
Pair
, but the output of that command is always just DataFrame
. Maybe a new function selector_output(source)
could provide something more helpful: either type information or a working function(selector_output(source)))
for testing before trying to apply a transformation?
I would like to see better documentation and error checking for this.
Better documentation - for sure we should do it. If I understand correctly you are working on this in your PR - right?
Error checking - it is impossible unfortunately to do anything better than what we do now + additionally adding information on what is accepted.
What is the correct way to submit my changes as a PR?
You have to push changes to your clone of the repository on GitHub not to the master repository. You do not have permissions to write to it. The easiest way: just edit the file you want to change in the GitHub interface. It will automatically create your personal clone of the repository and create a PR.
Another thing that is confusing, is how
filter
andsubset
are related to the transformation functions and each other.
These functions are referenced here in the manual: https://github.com/JuliaData/DataFrames.jl/blob/main/docs/src/man/working_with_dataframes.md#subsetting-functions.
As is explained in the last paragraph of this section subset
syntax is fully consistent with select
etc.
The filter
and filter!
functions are a different thing. Unfortunately they use a different syntax as filter
has to work row-wise as opposed to all other functions. We have discussed with @nalimilan a lot what to do about it. For now we just decided not to advertise it too much. Still - probably you as a user can propose something that would be an easily understood description for you (I can help with finding the right wording). This discrepancy is exactly why subset
was introduced. If we could make filter
consistent with select
etc. we would not have added subset
.
I don't know the best way to fix all the hurdles I am presenting
It is enough for me that you report the problematic cases and we can work out how to fix them either by:
- changing behavior
- throwing a better error
- or improving documentation (as sometimes this is the only solution)
Regarding your filter!
example the answer is that filter
uses a different syntax than select
etc. unfortunately.
function(selector_output(source)))
This is not possible the reason is that the syntax is.
If source
is NOT a AsTable
what is called is (roughly):
function(eachcol(select(df, source))...)
if source is a
AsTable(source)` then what is called is (roughly):
function(Tables.columntable(select(df, source)))
The crucial thing here why it is not possible is that when you pass a non-table input the columns are splatted.
However, in short one can think of that the function gets the columns of select(df, source)
data frame in two flavors:
- if
AsTable
is used - as aNamedTuple
- if it is not used - splatted as positional arguments
Now the question to you is how do you think we should explain it to the users so that it would be easily understandable.
Thank you!
I think it's important to remembers that new users don't know how to read errors and might not understand that there are multiple possible points of failure in a call.
julia> select(df, eachcol(df) => sum => :x) ERROR: ArgumentError: Unrecognized column selector: 1×1 DataFrameColumns Row │ a │ Int64 ─────┼─────── 1 │ 1
This is a good error message, but at the end of the day it's still red text users don't read.
We could give them a single function that lets them pinpoint at least one point of failure, the selection process which throws an error with eachcol(df)
.
get_cols(df, :a)
get_cols(df, r"aa")
get_cols(df, AsTable(Not(:a))
We could return a Tuple
of the columns if not AsTable
and a NamedTuple
if AsTable
.
But if a new user can't read an error message in a transform
call, what help will this function be? So it might not make sense to add it.
I think this conversation definitely reveals actionable items about documentation. People need to know that src
is a vector of column selectors and that f
should not be applied to src
. If that is communicated effectively, then it should be clear why eachcol(df) => f
is not going to work. Hopefully @nathanrboyer 's PR communicates that well.
Unfortunately they use a different syntax as filter has to work row-wise as opposed to all other functions. Still - probably you as a user can propose something that would be an easily understood description for you (I can help with finding the right wording). This discrepancy is exactly why subset was introduced. If we could make filter consistent with select etc. we would not have added subset.
Both filter
and subset
are probably useful to have. Your filter!
solution to the problem below is cleaner than what I came up with using subset!
since the problem requires a row-wise operation:
julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4×3 DataFrame
Row │ node x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 0
2 │ 2 0 0
3 │ 3 9 12
4 │ 4 5 8
julia> filter(row -> !all(x -> x == 0, row[Not(:node)]), df)
3×3 DataFrame
Row │ node x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 0
2 │ 3 9 12
3 │ 4 5 8
julia> subset!(df, Not(:node) => ByRow((row...) -> any(x -> x>0, row)))
3×3 DataFrame
Row │ node x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 0
2 │ 3 9 12
3 │ 4 5 8
That is why we keep filter
😄. The only downside is that it is inconsistent (and cannot be consistent given the contract for filter
in Julia Base) with subset
, select
, etc.