validate
validate copied to clipboard
Could methods for lists and matrices be added?
Apologies for the second open issue this week. Obviously I am pretty excited about validate
!
From what I can see, it is only possible to confront
things of class data.frame
. Is that correct?
I know data frames are the most common data structure (is there anything else in the tidyverse?!), but I still have use for lists and matrices.
I am only just beginning to dig into the internals, but it looks like it would be possible to support data structures other than data frames. Actually, the following addition to confrontation.R 'works' (and tests pass) in a local build:
#' @rdname confront
setMethod("confront", signature("list", "validator"), function(dat, x, key=NA_character_,...){
data_env <- list2env(dat)
data_env$. <- dat
confront_work(x,data_env,key,'validation',...)
})
#' @rdname confront
setMethod("confront", signature("matrix", "validator"), function(dat, x, ...){
data_env <- new.env()
data_env$. <- dat
confront_work(x,data_env,class = 'validation',...) # Assuming 'key' has no meaning here.
})
Is there any reason not to extend in this way?
I would be happy to offer a PR with supporting tests, if that is desired.
Hi there, thanks for the support!
Yes, extendability to other data types (and rule representations) is a deliberate choice in validate. It is the main reason why were needed S4 since there is no other OO system in R offering multiple dispatch. One thing on the wish list on a to support data in databases.
I'm currently traveling on holiday but we will definitely look into this when I'm back. Also, 2019 we'll have some resources to work more on validate.
FYI, I have implemented some basic handling for lists and "everything else" signature(ANY, ...)
over here:
https://github.com/dpritchard/validate/commit/912b87afff8f61131f58a53e0b5a6bf3734e4875
But I don't think this is going to be so simple if reference data is going to be included. The logic in match_rows()
doesn't generalise well for non-dataframe objects.
Oddly enough, as at the above commit, you can confront non-dataframe data and include reference data, e.g.:
l <- list("A" = 1:6, "B" = "test")
r <- list(A = 1:6)
v <- validator(A == ref$A)
confront(dat = l, x = v, ref = r)
and this doesn't throw an error for a missing method / signature. Instead, it dispatches on the signature without a reference defined... I am guessing that it is captured in ...
somehow, but I clearly don't really understand S4!
OK, so S4 dispatch is not too hard! This works as (I) expect:
https://github.com/dpritchard/validate/commit/4bc15df58ba749faac80ac64bdfca81d4124411a
I think that existing (data.frame
, based) method signatures should also be updated in a similar way... It depends what the design intention is, but I'd guess that if you provide a nonsense reference type (currently something that is not an environment
, a data.frame
or a list
) then method dispatch should fail. For example, this dispatches (but errors out, obviously):
confront(dat = data.frame(A=1), x = validator(A==ref$A), ref = complex(1L))
I can open another issue and provide a PR. But no rush, just let me know.
OK, so S4 dispatch is not too hard!
Well, maybe it's harder than I thought! Introducing a method for a list
creates a method dispatch ambiguity (for data.frames
), obviously.
To work around that I have just dropped the data.frame
method: https://github.com/dpritchard/validate/commit/ec1c348e31e4fcda6f0ad07e6d062a38f075f0f7
Sorry for all the noise, but I will forget this if I don't write it down somewhere!
No problem. These things take some careful consideration. Especially as lists can contain anything you can think of. Having some kind of definition of data structure is important prior to defining validation rules. Currently, validate
presumes a dataframe-like structure. You can see that for example because a validating statement like
if (x >0) y>0
will be vectorized. It is to me an open question how this should be handled for lists, which are pretty unstructured, apart from the fact that their content has a natural order (compare this to environment
where there is no order). Of course vectorization is implemented so that it can be turned on and off and so on, but all I'm saying is: there are some choices to be made. (that's why data validation is an interesting topic ;-) )
I am actually surprised that the ref=complex(1L)
dispatches. That's an issue.
Thanks for the reply.
I think it best to leave those design choices to you and the other package authors. For now the changes I have made to my local fork allow me to experiment as needed... OK for me, because I know the limitations, but as you say, it is hard to know how to handle these things in a general way.
The fix for the ref
dispatch problem is a minor one. As I have learnt recently, an S4 signature needs to follow the generic... So the trick is to be explicit by including all three of the generic arguments, naming them and defining ref = missing
(assuming you are expecting it be missing, which I think you are).
I have to confess that it is getting into the height of the holiday / barbecue season here in the Southern Hemisphere, so time for coding projects will be limited! But once I am back to my desk in the New Year and perhaps after Travis is fixed then I can send a PR.
Thanks again to you and others for providing these tools.
Would a pull request that makes the following changes be useful?
- Prevent dispatch on unsupported reference types and/or
- Allows lists and other types for the simple (non-reference) case
I have done the background coding and supporting tests, but I haven't looked at the documentation (such a change would need documenting, I'd guess?).
I have been using my local branch for a few weeks now without too much grief and I am getting very used to validating lists and matrixes!
Yes, we would certainly like that! Sorry once again for the very late reply!