frictionless-r icon indicating copy to clipboard operation
frictionless-r copied to clipboard

Make `read_resource()` more modular

Open peterdesmet opened this issue 1 year ago • 1 comments

read_resource() is very long. It could benefit from a more modular approach.

✅ Get resource, includes check_package()

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L202-L204

  • [x] Already a helper function

✅ Get paths, schema and fields

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L205-L210

  • [x] Not much gain here

✅ Check all selected columns appear in schema

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L211-L222

  • [x] Typical check, not much gain here

❌ Create locale with encoding, decimal_mark and grouping_mark

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L223-L262

  • [x] Needs helper function get_encoding() (or encoding())
  • [x] Needs resource$encoding and resource$fields, so best to pass resource (and package)
  • [x] Returns readr::locale

❌ Create col_types: list(<collector_character>, <collector_logical>, ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L263-L327

What we want is:

cols(
  field1 = col_integer(),
  field2 = col_character(),
  ...
)
  • [x] Helper function get_cols() (or cols()). It could work on a single field, but more useful is that it always returns a readr::cols()
  • [x] It then also aligns with a future get_na_cols() (or na_cols()) cf. https://kylehusmann.com/interlacer/reference/na_cols.html
  • [x] Maybe internally it also has a function get_col_type() (or col()) for a single field (so get_cols() is mostly a wrapper).
  • [x] Do we need a helper function per column type? Would likely improve readability if just have a switch on field type that then correctly passes the field to the correct function?

So:

get_cols <- function(fields) {
  for field:
    get_col(field)
}

get_col <- function(field) {
  type <-
  switch(type,
    "string" = col_string(),
    "number" = col_number(),
    ...
  )
}

col_string <- function() ...
col_number <- function(field) {
  get_enum()
  get_bare_number()
  get_format()
  ...
} 

❌ Assign names: list("name1" = <collector_character>, "name2" = ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L328-L330

  • [x] This should be integrated in get_cols()

✅ Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L331-L334

  • [x] Not much gain here

✅ Read data directly

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L335-L338

  • [x] Not much gain here

✅ Read data from data

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L339-L342

  • [x] Not much gain here

❌ Read data from path(s)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L343-L379

Is a for loop over paths, calling readr::read_delim() for all and then dplyr::bind_rows()

  • [x] Maybe we can have a read_delim() that calls all the get_cols() internally rather than building it encoding and cols and na_cols beforehand? But maybe that is harder to debug?

peterdesmet avatar Jun 10 '24 14:06 peterdesmet