frictionless-r
frictionless-r copied to clipboard
Compare header and schema
Update: this can now be defined in fieldMatch
#216
It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order. read_resource()
will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, will readr
return a parsing issue.
To avoid passing these issues silently, read_resource()
should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:
The field descriptor MUST contain a
name
property. This property SHOULD correspond to the name of field/column in the data file (if it has a name). As such it SHOULD be unique (though it is possible, but very bad practice, for the data file to have multiple columns with the same name).name
SHOULD NOT be considered case sensitive in determining uniqueness. However, since it should correspond to the name of the field in the data file it may be important to preserve case.
Implementation considerations:
-
[ ] Only compare when
replace_null(dialect$header, TRUE)
(i.e. it is not false). It might be useful to definedialect_header
and reuse it here: https://github.com/frictionlessdata/frictionless-r/blob/421c22f8be948006c5fb89a822124fbb803dff12/R/read_resource.R#L356 -
[ ] The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing
-
[ ] To allow comparison, the header line of the file should be read separately from the main
read_delim()
.read_lines()
could be used, butdelim
andencoding/locale
might have to be passed too. -
[ ] A resource can contain multiple files (e.g.
observations_1
,observations_2
). Either all files are read and compared or only the last once, cf.add_resource()
:The last file will be read with readr::read_delim() to create or compare with schema and to set format, mediatype and encoding. The other files are ignored, but are expected to have the same structure and properties.
-
[ ] On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to
check_schema()
: https://github.com/frictionlessdata/frictionless-r/blob/421c22f8be948006c5fb89a822124fbb803dff12/R/check_schema.R#L65-L69 -
[ ] Add a section validation to explain what we validate:
#' @section Validation: #' Full validation is not supported. #' Something about validation issues #' Something about header compare
Some questions:
Multipart resources
- What about multipart resources, should all parts of the resource be checked? Or just the first/last one?
- For multipart resources, will they always either all have a header, or none of them? Or is it possible for example only the first resource has a header?
Naming
What would be a good argument name to toggle this comparison/check?
-
check_header = TRUE
-
compare_header = TRUE
-
check_fields = TRUE
Default behavior
I assume that read_resource()
should be default not compare the header and the schema?
- Multipart resources: to increase performance (especially when reading over URL) I'd be fine with the last file being read.
- A header or not is defined at
resource
level, meaning all files should comply. - I would not add a parameter in
read_resource
, but always include this check. It is a recommended part of the specs:This property SHOULD correspond ...