DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Add a validator for src => transform => dst

Open bkamins opened this issue 2 years ago • 24 comments

In order to improve readability of error messages

bkamins avatar Sep 08 '21 19:09 bkamins

@pdeffebach - can you give me please an example when we produce something that is not clear? I have checked and for incorrect selectors we produce:

julia> select(df, 'a')
ERROR: ArgumentError: Unrecognized column selector: a

bkamins avatar Sep 09 '21 13:09 bkamins

I think the best suggestion for this was a function to help construct pairs.

make_pair(source = [:a, :b], fun = f, dest = AsTable)`
transform(df, make_pair(...))

There should also be functionality added to make typeof(source), typeof(f(source)), and typeof(dest) work.

This might also help with #2870 if the function syntax would make it easier to programatically create a list of args... to pass to transform and the like.

nathanrboyer avatar Sep 10 '21 14:09 nathanrboyer

Could you please explain what is the advantage of:

make_pair(source = [:a, :b], fun = f, dest = AsTable)

over

[:a, :b] => f => AsTable

?

For me it seems more verbose and less readable.

There should also be functionality added to make typeof(source), typeof(f(source)), and typeof(dest) work.

I am not fully clear what you mean here? Could you please provide an example of assumed input and expected output?

Thank you!

bkamins avatar Sep 10 '21 15:09 bkamins

It is only advantageous if the new function (or package) can provide additional error checking, type checking, and validation. Ideally, it could tell you what part of your pair is invalid or needs broadcasted.

nathanrboyer avatar Sep 10 '21 16:09 nathanrboyer

Ideally, it could tell you what part of your pair is invalid or needs broadcasted.

OK - now I understand.

The problem is that it is extremely hard (if not impossible). The problem is that Julia is very flexible what you can define to work (e.g. consider https://docs.julialang.org/en/v1/manual/methods/#Function-like-objects-1 which are objects, but are callable - you cannot detect such things by checking types unfortunately)

But I will think what can be done.

bkamins avatar Sep 10 '21 16:09 bkamins

I'll call on @pdeffebach and @CameronBieganek to add their thoughts too.

nathanrboyer avatar Sep 10 '21 16:09 nathanrboyer

I'm afraid I don't know enough to be helpful with the implementation details. Could you start by supporting more basic types and just print a warning "Validation not yet supported" for functors and such? I imagine advanced users would construct the pairs themselves anyway.

nathanrboyer avatar Sep 10 '21 16:09 nathanrboyer

I think another problem is that because the rules are so complicated, having a validator be separate from the place where the error is would lead to code duplication and, at worst, divergent behavior.

One thing we could also do is export a prettier version of normalize_selection, which is where a lot of the pre-processing happens. This would, in effect, be a validator. I will check if this is feasible.

pdeffebach avatar Sep 15 '21 01:09 pdeffebach

I will check if this is feasible.

We would just need to allow passing a data frame, or grouped data frame instead of an index. This is fully feasible. Plus we need to warn users that normalize_selection works on a single transform (because .=> stuff happens before). This also means that when we soon allow Not(:x) .=> fun any errors in this would not be caught as it has to be resolved before being passed to normalize_selection.

In short - I agree that it is hard. I have added one answer on SO https://stackoverflow.com/questions/69194267/failing-to-execute-column-transformation-in-dataframes-jl discussing the problem that is impossible to solve but is pretty common.

bkamins avatar Sep 15 '21 14:09 bkamins

Did normalize_selection get more features in another PR or is there really nothing that can be done to get better error checking on Pairs?

nathanrboyer avatar Oct 11 '21 15:10 nathanrboyer

No. I haven't gotten around to adding more features to normalize_selection. I'm re-opening this PR to keep track of it.

pdeffebach avatar Oct 11 '21 16:10 pdeffebach

normalize_selection is in the process of getting new features now. Maybe to help @pdeffebach / me can you give a list of examples of incorrect transformations and the error you get now. Then I can assess if such a transformation can be handled in a better way.

An example what I mean.

If you gave:

julia> combine(DataFrame(a=1), :a => x -> 2x => :b)
1×1 DataFrame
 Row │ a_function
     │ Pair…
─────┼────────────
   1 │ [2]=>:b

which is most likely not what the user wanted but rather combine(DataFrame(a=1), :a => (x -> 2x) => :b) was probably wanted.

My answer would be: We cannot do anything about it. It is possible that this is what the user wanted (although unlikely). Also it is technically impossible to detect this kind of error.

If you gave:

julia> combine(DataFrame(a=1,b=2), [:a, :b] => identity)
ERROR: MethodError: no method matching identity(::Vector{Int64}, ::Vector{Int64})

which is most likely not what the user wanted but rather combine(DataFrame(a=1,b=2), [:a, :b] .=> identity) was probably wanted.

My answer would be: We cannot do anything about it. It is possible that identity function could have a method taking two positional arguments. The error given is the best we can give.

bkamins avatar Oct 11 '21 18:10 bkamins

The best examples I have are linked in the discourse post which prompted this issue. I am currently writing a documentation PR for the "Basic Usage of Transformation Functions" section which I hope will also help.

nathanrboyer avatar Oct 11 '21 19:10 nathanrboyer

I have found in this thread only one example that you gave that uses src => transform => dst minilanguage and it is:

transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)

The answer for this case is: it is impossible to give a better error message. By the time transform! gets this specification it is the same as if you have written r"Temp" => ByRow(fahrenheit_to_celsius) which is potentially a fully valid transformation specification.


As commented above - if you have any other problematic examples please let me know and I will check if they are fixable. I am asking for this because here we need to move from vague impressions (which are very important in general) to concrete things that we can evaluate if they can be fixed.

E.g. https://github.com/JuliaData/DataFrames.jl/issues/2867 was a concrete request and we have already fixed it.

bkamins avatar Oct 11 '21 20:10 bkamins

iirc OP had confusion about eachcol and similar operators.

eachcol(df) => f => :x

which should fail on normalize_selection, I believe. But we don't expose normalize_selection to the user so it's a difficult thing for new users to parse.

pdeffebach avatar Oct 11 '21 21:10 pdeffebach

What is confusing here:

julia> select(df, eachcol(df) => sum => :x)
ERROR: ArgumentError: Unrecognized column selector: 1×1 DataFrameColumns
 Row │ a
     │ Int64
─────┼───────
   1 │     1

?

For me the error seems pretty explicit: you passed a column selector eachcol(df) which is printed in the error message and this selector is invalid. How do you propose to improve the error message for this case? We potentially might list all classes of valid column selectors. Is this what you think should be changed?

bkamins avatar Oct 12 '21 09:10 bkamins

One thing that is confusing is that the first argument of the pair is not what is passed to the function. In source => fun, the source is not passed, but instead whatever DataFrames looks up using source. That lookup is invisible to the user, so it is difficult to determine if your source looked up what your wanted and if your fun is constructed so that it takes the inputs as the lookup spits them out. This was part of the problem with r"Temp" .=> ByRow(fahrenheit_to_celsius): the lookup with r"Temp" was not creating an iterable vector like I thought so fahrenheit_to_celsius wasn't constructed to take the inputs it should have. Similarly, you would think sum would take eachcol(df) no problem because sum has a method for the output of eachcol(df), and that is easy to test. However, the invisible translation from column selector to columns obfuscates the type passing. I would like to see better documentation and error checking for this.

Another thing that is confusing, is how filter and subset are related to the transformation functions and each other. They are not mentioned in the transformation section, but they take a Pair as input and transform the data frame just like the other transformation functions. Although, it seems that the syntax is not exactly the same. Consider the following example from the discourse post:

julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      0
   2 │     2      0      0
   3 │     3      9     12
   4 │     4      5      8 

julia> filter!(Not(:node) => ByRow(row -> any(x -> x>0, row)), df)
ERROR: MethodError: no method matching (::ByRow{var"#21#23"})(::Int64, ::Int64)

The first example under ?filter shows constructing an anonymous function with row, so I followed that example to construct my anonymous function by row. I now want to restrict the filter to only look at certain columns, so I pass those columns to my function with a column selector. But again the first confusion rears its head when Not(:node) => ByRow(f) does not pass a DataFrameRow to f, so my function is constructed to accept the wrong inputs.

I am asking for this because here we need to move from vague impressions (which are very important in general) to concrete things that we can evaluate if they can be fixed.

I am trying to help suggest fixes, but honestly I am still kind of confused on how everything works. I don't know the best way to fix all the hurdles I am presenting, but I doubt that rewording the existing error messages is the full solution to simplifying understanding of the mini-language syntax.

nathanrboyer avatar Oct 12 '21 15:10 nathanrboyer

Sidenote I did:

  1. ] dev DataFrames
  2. Created a new branch origin/nb/docs_transformations
  3. Edited docs/src/man/basics.md
  4. Committed my changes to origin/nb/docs_transformations
  5. Clicked "Publish Changes" in VSCode.

But I got an error:

> git push -u origin origin/nb/docs_transformations
remote: Permission to JuliaData/DataFrames.jl.git denied to nathanrboyer.
fatal: unable to access 'https://github.com/JuliaData/DataFrames.jl.git/': The requested URL returned error: 403

What is the correct way to submit my changes as a PR? Clicking "New Pull Request" on GitHub didn't seem to do anything.

nathanrboyer avatar Oct 12 '21 15:10 nathanrboyer

We potentially might list all classes of valid column selectors. Is this what you think should be changed?

That might be a good idea, or at least a link. Although, even after reading the Indexing Doc, I would not have known it is meant to list what can be used as the first part of any transformation Pair except that your specifically told me that in a message.

nathanrboyer avatar Oct 12 '21 15:10 nathanrboyer

I was thinking that typeof(select(df, source)) might help me construct my function correctly in the source => function Pair, but the output of that command is always just DataFrame. Maybe a new function selector_output(source) could provide something more helpful: either type information or a working function(selector_output(source))) for testing before trying to apply a transformation?

nathanrboyer avatar Oct 12 '21 16:10 nathanrboyer

I would like to see better documentation and error checking for this.

Better documentation - for sure we should do it. If I understand correctly you are working on this in your PR - right?

Error checking - it is impossible unfortunately to do anything better than what we do now + additionally adding information on what is accepted.

What is the correct way to submit my changes as a PR?

You have to push changes to your clone of the repository on GitHub not to the master repository. You do not have permissions to write to it. The easiest way: just edit the file you want to change in the GitHub interface. It will automatically create your personal clone of the repository and create a PR.

Another thing that is confusing, is how filter and subset are related to the transformation functions and each other.

These functions are referenced here in the manual: https://github.com/JuliaData/DataFrames.jl/blob/main/docs/src/man/working_with_dataframes.md#subsetting-functions.

As is explained in the last paragraph of this section subset syntax is fully consistent with select etc.

The filter and filter! functions are a different thing. Unfortunately they use a different syntax as filter has to work row-wise as opposed to all other functions. We have discussed with @nalimilan a lot what to do about it. For now we just decided not to advertise it too much. Still - probably you as a user can propose something that would be an easily understood description for you (I can help with finding the right wording). This discrepancy is exactly why subset was introduced. If we could make filter consistent with select etc. we would not have added subset.

I don't know the best way to fix all the hurdles I am presenting

It is enough for me that you report the problematic cases and we can work out how to fix them either by:

  • changing behavior
  • throwing a better error
  • or improving documentation (as sometimes this is the only solution)

Regarding your filter! example the answer is that filter uses a different syntax than select etc. unfortunately.

function(selector_output(source)))

This is not possible the reason is that the syntax is.

If source is NOT a AsTable what is called is (roughly):

function(eachcol(select(df, source))...)

if source is a AsTable(source)` then what is called is (roughly):

function(Tables.columntable(select(df, source)))

The crucial thing here why it is not possible is that when you pass a non-table input the columns are splatted.


However, in short one can think of that the function gets the columns of select(df, source) data frame in two flavors:

  • if AsTable is used - as a NamedTuple
  • if it is not used - splatted as positional arguments

Now the question to you is how do you think we should explain it to the users so that it would be easily understandable.

Thank you!

bkamins avatar Oct 12 '21 17:10 bkamins

I think it's important to remembers that new users don't know how to read errors and might not understand that there are multiple possible points of failure in a call.

julia> select(df, eachcol(df) => sum => :x)
ERROR: ArgumentError: Unrecognized column selector: 1×1 DataFrameColumns
 Row │ a
     │ Int64
─────┼───────
   1 │     1

This is a good error message, but at the end of the day it's still red text users don't read.

We could give them a single function that lets them pinpoint at least one point of failure, the selection process which throws an error with eachcol(df).

get_cols(df, :a)
get_cols(df, r"aa")
get_cols(df, AsTable(Not(:a))

We could return a Tuple of the columns if not AsTable and a NamedTuple if AsTable.

But if a new user can't read an error message in a transform call, what help will this function be? So it might not make sense to add it.

I think this conversation definitely reveals actionable items about documentation. People need to know that src is a vector of column selectors and that f should not be applied to src. If that is communicated effectively, then it should be clear why eachcol(df) => f is not going to work. Hopefully @nathanrboyer 's PR communicates that well.

pdeffebach avatar Oct 12 '21 18:10 pdeffebach

Unfortunately they use a different syntax as filter has to work row-wise as opposed to all other functions. Still - probably you as a user can propose something that would be an easily understood description for you (I can help with finding the right wording). This discrepancy is exactly why subset was introduced. If we could make filter consistent with select etc. we would not have added subset.

Both filter and subset are probably useful to have. Your filter! solution to the problem below is cleaner than what I came up with using subset! since the problem requires a row-wise operation:

julia> df = DataFrame(node=1:4, x=[1,0,9,5], y=[0,0,12,8])
4×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      0
   2 │     2      0      0
   3 │     3      9     12
   4 │     4      5      8

julia> filter(row -> !all(x -> x == 0, row[Not(:node)]), df)
3×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      0
   2 │     3      9     12
   3 │     4      5      8

julia> subset!(df, Not(:node) => ByRow((row...) -> any(x -> x>0, row)))
3×3 DataFrame
 Row │ node   x      y     
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      0
   2 │     3      9     12
   3 │     4      5      8

nathanrboyer avatar Oct 12 '21 19:10 nathanrboyer

That is why we keep filter 😄. The only downside is that it is inconsistent (and cannot be consistent given the contract for filter in Julia Base) with subset, select, etc.

bkamins avatar Oct 12 '21 21:10 bkamins