frictionless-r `col_select` in `read_package()` does not support tidyselection

I expected col_select in read_package to allow tidyselection like it's namesake arg in read_delim et al., but it looks like it only supports character vectors:

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=comments)

# Error in col_select %in% field_names : object 'comments' not found

system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=starts_with("c"))

# Error in starts_with("c") : could not find function "starts_with"

Nov 11 '23 18:11 khusmann

Hmmm in read_resource is field_names basically the same as col_names?

field_names looks like it's only used with the col_select assertion at the top there, and is the only thing the tidyselect is breaking on. If you remove it entirely (lines 210-228), tidyselect works just fine (because it's directly passed to read_delim).

Was the all(col_select %in% field_names) assertion added to make the error message a little more readable? When I have the chance I'll look into the tidyselection stuff to see if there's a tidy way to make that assertion instead...

Nov 11 '23 19:11 khusmann

I looked into this a bit more, and it looks like the cryptic error messages thrown by col_select on invalid selections are a vroom issue. Once that is resolved, we should be able to directly forward col_select to read_delim, and always get nice error messages. So I don't think we need to make the assertion on our end; we can just remove lines 210-228 to resolve this issue.

Nov 12 '23 20:11 khusmann

Hi @khusmann, thanks for looking into this.

Note that the variable field_names contains the names of the fields in the schema (not the headers in the CSV).

Lines 212-228 checks up front if col_select only contains names that were defined in the schema. If that can be simplified and adapted to support tidyselection, great!
Indeed col_names is basically the same as field_names, but checked additionally to make sure there are no empty names (i.e. fields in the schema without a name). The code could be adapted to do that checking earlier (e.g. right after line 210 or even as part of get_schema() ). That would allow to use the variable field_names throughout (and not additionally define col_names).

Nov 13 '23 08:11 peterdesmet

Thanks for the feedback!

I think the simplest way of doing this is to simply add .default=col_skip() to the col_types list in read_delim. Because col_types is created via the schema, this would prevent loading any columns in the csv that weren't also valid fields. Unfortunately, with the vroom issue I mention above, you get cryptic errors in the special case when the user selects no valid fields. If we want to throw our own errors instead, instead of waiting on vroom to have this fixed, lmk, and I can do that instead with our own tidyselection call.

Side note: right now it doesn't look like col_select applies on resources with read_from == "df" and read_from == "data"... do we want it to apply to all calls of read_resource, or just when reading CSVs? (I think all calls probably makes the most sense)

Hmmm, it looks like the "no empty names check" happens already in check_schema() (L20) which is called by get_schema(). So if I'm reading this correctly I think we can treat the name field as a guaranteed attribute of a well-formed schema, no need to check again... what do you think?

Nov 13 '23 19:11 khusmann

There have been some updates in the code, so I'll summarize what needs to be done:

[x] Don't define col_names, just use field_names throughout (item 2 in this and this comment: https://github.com/frictionlessdata/frictionless-r/commit/44b2c5562ea2df5f0a7e8b07a24f51ec26263076
[ ] Convert a tidyselection in col_select so frictionless_error_colselect_mismatch (L211-221) can deal with that. Alternatively and not preferred: drop the frictionless_error_colselect_mismatch check and handle vroom errors like Error in UseMethod("collector_value").
[ ] Write tests for this behaviour.
[x] Extend col_select from CSV resources to all resources (including df and data). Described in #197

Mar 27 '24 07:03 peterdesmet