validate icon indicating copy to clipboard operation
validate copied to clipboard

Could methods for lists and matrices be added?

Open dpritchard opened this issue 6 years ago • 8 comments

Apologies for the second open issue this week. Obviously I am pretty excited about validate!

From what I can see, it is only possible to confront things of class data.frame. Is that correct?

I know data frames are the most common data structure (is there anything else in the tidyverse?!), but I still have use for lists and matrices.

I am only just beginning to dig into the internals, but it looks like it would be possible to support data structures other than data frames. Actually, the following addition to confrontation.R 'works' (and tests pass) in a local build:

#' @rdname confront
setMethod("confront", signature("list", "validator"), function(dat, x, key=NA_character_,...){
    data_env <- list2env(dat)
    data_env$. <- dat
    confront_work(x,data_env,key,'validation',...)
})

#' @rdname confront
setMethod("confront", signature("matrix", "validator"), function(dat, x, ...){
    data_env <- new.env()
    data_env$. <- dat
    confront_work(x,data_env,class = 'validation',...) # Assuming 'key' has no meaning here.
})

Is there any reason not to extend in this way?

I would be happy to offer a PR with supporting tests, if that is desired.

dpritchard avatar Nov 10 '18 05:11 dpritchard

Hi there, thanks for the support!

Yes, extendability to other data types (and rule representations) is a deliberate choice in validate. It is the main reason why were needed S4 since there is no other OO system in R offering multiple dispatch. One thing on the wish list on a to support data in databases.

I'm currently traveling on holiday but we will definitely look into this when I'm back. Also, 2019 we'll have some resources to work more on validate.

markvanderloo avatar Nov 18 '18 10:11 markvanderloo

FYI, I have implemented some basic handling for lists and "everything else" signature(ANY, ...) over here:

https://github.com/dpritchard/validate/commit/912b87afff8f61131f58a53e0b5a6bf3734e4875

But I don't think this is going to be so simple if reference data is going to be included. The logic in match_rows() doesn't generalise well for non-dataframe objects.

Oddly enough, as at the above commit, you can confront non-dataframe data and include reference data, e.g.:

l <- list("A" = 1:6, "B" = "test")
r <- list(A = 1:6)
v <- validator(A == ref$A)
confront(dat = l, x = v, ref = r)

and this doesn't throw an error for a missing method / signature. Instead, it dispatches on the signature without a reference defined... I am guessing that it is captured in ...somehow, but I clearly don't really understand S4!

dpritchard avatar Dec 15 '18 06:12 dpritchard

OK, so S4 dispatch is not too hard! This works as (I) expect:

https://github.com/dpritchard/validate/commit/4bc15df58ba749faac80ac64bdfca81d4124411a

I think that existing (data.frame, based) method signatures should also be updated in a similar way... It depends what the design intention is, but I'd guess that if you provide a nonsense reference type (currently something that is not an environment, a data.frame or a list) then method dispatch should fail. For example, this dispatches (but errors out, obviously):

confront(dat = data.frame(A=1), x = validator(A==ref$A), ref = complex(1L))

I can open another issue and provide a PR. But no rush, just let me know.

dpritchard avatar Dec 15 '18 23:12 dpritchard

OK, so S4 dispatch is not too hard!

Well, maybe it's harder than I thought! Introducing a method for a list creates a method dispatch ambiguity (for data.frames), obviously.

To work around that I have just dropped the data.frame method: https://github.com/dpritchard/validate/commit/ec1c348e31e4fcda6f0ad07e6d062a38f075f0f7

Sorry for all the noise, but I will forget this if I don't write it down somewhere!

dpritchard avatar Dec 16 '18 01:12 dpritchard

No problem. These things take some careful consideration. Especially as lists can contain anything you can think of. Having some kind of definition of data structure is important prior to defining validation rules. Currently, validate presumes a dataframe-like structure. You can see that for example because a validating statement like

if (x >0) y>0

will be vectorized. It is to me an open question how this should be handled for lists, which are pretty unstructured, apart from the fact that their content has a natural order (compare this to environment where there is no order). Of course vectorization is implemented so that it can be turned on and off and so on, but all I'm saying is: there are some choices to be made. (that's why data validation is an interesting topic ;-) )

I am actually surprised that the ref=complex(1L) dispatches. That's an issue.

markvanderloo avatar Dec 17 '18 14:12 markvanderloo

Thanks for the reply.

I think it best to leave those design choices to you and the other package authors. For now the changes I have made to my local fork allow me to experiment as needed... OK for me, because I know the limitations, but as you say, it is hard to know how to handle these things in a general way.

The fix for the ref dispatch problem is a minor one. As I have learnt recently, an S4 signature needs to follow the generic... So the trick is to be explicit by including all three of the generic arguments, naming them and defining ref = missing (assuming you are expecting it be missing, which I think you are).

I have to confess that it is getting into the height of the holiday / barbecue season here in the Southern Hemisphere, so time for coding projects will be limited! But once I am back to my desk in the New Year and perhaps after Travis is fixed then I can send a PR.

Thanks again to you and others for providing these tools.

dpritchard avatar Dec 17 '18 20:12 dpritchard

Would a pull request that makes the following changes be useful?

  1. Prevent dispatch on unsupported reference types and/or
  2. Allows lists and other types for the simple (non-reference) case

I have done the background coding and supporting tests, but I haven't looked at the documentation (such a change would need documenting, I'd guess?).

I have been using my local branch for a few weeks now without too much grief and I am getting very used to validating lists and matrixes!

dpritchard avatar Jan 18 '19 21:01 dpritchard

Yes, we would certainly like that! Sorry once again for the very late reply!

markvanderloo avatar Apr 10 '19 11:04 markvanderloo