pointblank icon indicating copy to clipboard operation
pointblank copied to clipboard

create agents from file, list or data.frame

Open matthiasgomolka opened this issue 4 years ago • 3 comments

Hi Rich,

this is the follow-up Issue to our Twitter conversation: https://twitter.com/riannone/status/1281491772028452864. I would like to divide the possibilities I've though of into two separate approaches.

rule-focused approach

First, in case you haven't seen it yet, this is the vignette which describes how importing and exporting of validation rules works in validate. I have used this and I think it works pretty well. I think it is especially helpful that you can import validation rules from a variety of sources, like yaml, but also data.frames.

I would describe this a a rule-focused approach, since each item in the yaml represents a single validation rule.

variable-focused approach

What I'm working on right now is more a variable-focused approach. What I mean by that is that we describe our data using a metadata.yaml file. This file contains some general information about the dataset in question and a section on columns: This section looks like this:

columns:
  VAR1:
    label: Short label
    notes: Some longer notes
    constraints:
    - type: integer
    - greater-than: 0
    - not-null: yes
    ....

  VAR2:
    ...

Here, each item in constraints needs to correspond to a pointblank validation function. (Since pointblank does not yet cover all of our needs, we bridge the gap with custom functions which translate our needs to col_vals_expr())

This is also, where my question on Twitter was coming from: When I read that yaml into a list, it is easy to create an agent for each constraints element of each column. But reducing them is not yet possible. (I fall back to a R-unlike approach, where I loop through all list elements and add the resulting validation functions step by step to an agent which was created before.)

The advantage of this variable-focused approach (imho) is, that it is easier for people to see which constraints apply to a specific variable and if anything is missing. Also, this might make it possible for the yaml file to carry additional information about the column which might be of interest at some other point in the pipeline (this is the case in our scenario).

On the other hand, in this approach it is not straight-forward how to describe rules which apply to the whole dataset, like nrow(data) > 1000. Also, it may result in more typing since functions like cols_not_null() are likely to apply to several columns, which would easier/shorter to describe in a rule-focused approach (see the section on "Groups" in ?validate::syntax).

general thoughts

As I've mentioned above, it would be very helpful if reading validation rules would be possible from different sources, like yaml, but also from R objects like data.frames or lists. This would make it possible to "translate" other formats of storing the validation functions into a format which can be understood by pointblank within R.

Please ask for clarification, if this is unclear at some point!

matthiasgomolka avatar Jul 10 '20 12:07 matthiasgomolka

There's now some infrastructure for writing/reading YAML (with the agent_yaml_write() and agent_yaml_read() functions). So far YAML is 1:1 with the API through the steps YAML key. Here's an example:

name: example
read_fn: ~small_table
actions:
  warn_fraction: 0.1
  stop_fraction: 0.25
  notify_fraction: 0.35
steps:
- col_exists:
    columns: vars(date)
- col_exists:
    columns: vars(date_time)
- col_vals_regex:
    columns: vars(b)
    regex: '[0-9]-[a-z]{3}-[0-9]{3}'

What you're proposing makes the use of YAML in pointblank much more useful for a lot of applications. I like the idea of a label and notes for describing the columns. Aside from this, we are currently missing a label field for the validation steps (there is briefs, which could be notes for a validation step, but there isn't yet a label for a validation step).

I'll add both these pieces of work as separate issues. I think that a single YAML file could serve double duty as a data dictionary (way to describe columns, the dataset itself, meaning of rows, etc.) as well as a place to store validation directives (both column-focused and function-focused).

This work should be done ahead of adding the functionality for using the constraints list. I'm pretty excited about all of this and thank you for your patience so far.

As far as data frames and lists for holding validation rules, that's definitely not out of the realm of possibility. But that work should probably done last. For the df case, I can imagine it being quite useful so long as the form of the table makes sense to the user.

rich-iannone avatar Aug 24 '20 00:08 rich-iannone

Thanks for your efforts so far! I still need to look into the new feature from #149 in more detail. But I completely share your view on the double duty of the YAML file. This is exactly what we are experimenting with right now.

matthiasgomolka avatar Sep 14 '20 07:09 matthiasgomolka

Great! Development is pretty focused on the combination of table metadata and table validation. It’s such a huge change that it’ll probably go through a few iterations to make it feel right. Feel free to open issues or comment on existing issues with any feedback as these features are developed.

rich-iannone avatar Sep 14 '20 07:09 rich-iannone