frictionless-r
frictionless-r copied to clipboard
Make `read_resource()` more modular
read_resource() is very long. It could benefit from a more modular approach.
✅ Get resource, includes check_package()
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L202-L204
- [x] Already a helper function
✅ Get paths, schema and fields
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L205-L210
- [x] Not much gain here
✅ Check all selected columns appear in schema
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L211-L222
- [x] Typical check, not much gain here
❌ Create locale with encoding, decimal_mark and grouping_mark
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L223-L262
- [x] Needs helper function
get_encoding()(orencoding()) - [x] Needs
resource$encodingandresource$fields, so best to passresource(andpackage) - [x] Returns
readr::locale
❌ Create col_types: list(<collector_character>, <collector_logical>, ...)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L263-L327
What we want is:
cols(
field1 = col_integer(),
field2 = col_character(),
...
)
- [x] Helper function
get_cols()(orcols()). It could work on a single field, but more useful is that it always returns areadr::cols() - [x] It then also aligns with a future
get_na_cols()(orna_cols()) cf. https://kylehusmann.com/interlacer/reference/na_cols.html - [x] Maybe internally it also has a function
get_col_type()(orcol()) for a single field (soget_cols()is mostly a wrapper). - [x] Do we need a helper function per column type? Would likely improve readability if just have a switch on field type that then correctly passes the field to the correct function?
So:
get_cols <- function(fields) {
for field:
get_col(field)
}
get_col <- function(field) {
type <-
switch(type,
"string" = col_string(),
"number" = col_number(),
...
)
}
col_string <- function() ...
col_number <- function(field) {
get_enum()
get_bare_number()
get_format()
...
}
❌ Assign names: list("name1" = <collector_character>, "name2" = ...)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L328-L330
- [x] This should be integrated in
get_cols()
✅ Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L331-L334
- [x] Not much gain here
✅ Read data directly
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L335-L338
- [x] Not much gain here
✅ Read data from data
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L339-L342
- [x] Not much gain here
❌ Read data from path(s)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L343-L379
Is a for loop over paths, calling readr::read_delim() for all and then dplyr::bind_rows()
- [x] Maybe we can have a
read_delim()that calls all theget_cols()internally rather than building itencodingandcolsandna_colsbeforehand? But maybe that is harder to debug?