frictionless-r icon indicating copy to clipboard operation
frictionless-r copied to clipboard

Compare header and schema

Open peterdesmet opened this issue 1 year ago • 2 comments

Update: this can now be defined in fieldMatch #216


It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order. read_resource() will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, will readr return a parsing issue.

To avoid passing these issues silently, read_resource() should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:

The field descriptor MUST contain a name property. This property SHOULD correspond to the name of field/column in the data file (if it has a name). As such it SHOULD be unique (though it is possible, but very bad practice, for the data file to have multiple columns with the same name). name SHOULD NOT be considered case sensitive in determining uniqueness. However, since it should correspond to the name of the field in the data file it may be important to preserve case.

Implementation considerations:

  • [ ] Only compare when replace_null(dialect$header, TRUE) (i.e. it is not false). It might be useful to define dialect_header and reuse it here: https://github.com/frictionlessdata/frictionless-r/blob/421c22f8be948006c5fb89a822124fbb803dff12/R/read_resource.R#L356

  • [ ] The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing

  • [ ] To allow comparison, the header line of the file should be read separately from the main read_delim(). read_lines() could be used, but delim and encoding/locale might have to be passed too.

  • [ ] A resource can contain multiple files (e.g. observations_1, observations_2). Either all files are read and compared or only the last once, cf. add_resource():

    The last file will be read with readr::read_delim() to create or compare with schema and to set format, mediatype and encoding. The other files are ignored, but are expected to have the same structure and properties.

  • [ ] On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to check_schema(): https://github.com/frictionlessdata/frictionless-r/blob/421c22f8be948006c5fb89a822124fbb803dff12/R/check_schema.R#L65-L69

  • [ ] Add a section validation to explain what we validate:

    #' @section Validation:
    #' Full validation is not supported.
    #' Something about validation issues
    #' Something about header compare
    

peterdesmet avatar Mar 21 '23 14:03 peterdesmet

Some questions:

Multipart resources

  • What about multipart resources, should all parts of the resource be checked? Or just the first/last one?
  • For multipart resources, will they always either all have a header, or none of them? Or is it possible for example only the first resource has a header?

Naming

What would be a good argument name to toggle this comparison/check?

  • check_header = TRUE
  • compare_header = TRUE
  • check_fields = TRUE

Default behavior

I assume that read_resource() should be default not compare the header and the schema?

PietrH avatar Mar 22 '23 11:03 PietrH

  • Multipart resources: to increase performance (especially when reading over URL) I'd be fine with the last file being read.
  • A header or not is defined at resource level, meaning all files should comply.
  • I would not add a parameter in read_resource, but always include this check. It is a recommended part of the specs: This property SHOULD correspond ...

peterdesmet avatar Mar 22 '23 11:03 peterdesmet