`col_select` in `read_package()` does not support tidyselection
I expected col_select in read_package to allow tidyselection like it's namesake arg in read_delim et al., but it looks like it only supports character vectors:
system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=comments)
# Error in col_select %in% field_names : object 'comments' not found
system.file("extdata", "datapackage.json", package="frictionless") %>%
read_package() %>%
read_resource("deployments", col_select=starts_with("c"))
# Error in starts_with("c") : could not find function "starts_with"
Hmmm in read_resource is field_names basically the same as col_names?
field_names looks like it's only used with the col_select assertion at the top there, and is the only thing the tidyselect is breaking on. If you remove it entirely (lines 210-228), tidyselect works just fine (because it's directly passed to read_delim).
Was the all(col_select %in% field_names) assertion added to make the error message a little more readable? When I have the chance I'll look into the tidyselection stuff to see if there's a tidy way to make that assertion instead...
I looked into this a bit more, and it looks like the cryptic error messages thrown by col_select on invalid selections are a vroom issue. Once that is resolved, we should be able to directly forward col_select to read_delim, and always get nice error messages. So I don't think we need to make the assertion on our end; we can just remove lines 210-228 to resolve this issue.
Hi @khusmann, thanks for looking into this.
Note that the variable field_names contains the names of the fields in the schema (not the headers in the CSV).
-
Lines 212-228 checks up front if
col_selectonly contains names that were defined in the schema. If that can be simplified and adapted to support tidyselection, great! - Indeed
col_namesis basically the same asfield_names, but checked additionally to make sure there are no empty names (i.e. fields in the schema without aname). The code could be adapted to do that checking earlier (e.g. right after line 210 or even as part of get_schema() ). That would allow to use the variablefield_namesthroughout (and not additionally definecol_names).
Thanks for the feedback!
- I think the simplest way of doing this is to simply add
.default=col_skip()to thecol_typeslist inread_delim. Because col_types is created via the schema, this would prevent loading any columns in the csv that weren't also valid fields. Unfortunately, with the vroom issue I mention above, you get cryptic errors in the special case when the user selects no valid fields. If we want to throw our own errors instead, instead of waiting on vroom to have this fixed, lmk, and I can do that instead with our own tidyselection call.
Side note: right now it doesn't look like col_select applies on resources with read_from == "df" and read_from == "data"... do we want it to apply to all calls of read_resource, or just when reading CSVs? (I think all calls probably makes the most sense)
- Hmmm, it looks like the "no empty names check" happens already in
check_schema()(L20) which is called byget_schema(). So if I'm reading this correctly I think we can treat the name field as a guaranteed attribute of a well-formed schema, no need to check again... what do you think?
There have been some updates in the code, so I'll summarize what needs to be done:
- [x] Don't define
col_names, just usefield_namesthroughout (item 2 in this and this comment: https://github.com/frictionlessdata/frictionless-r/commit/44b2c5562ea2df5f0a7e8b07a24f51ec26263076 - [ ] Convert a tidyselection in
col_selectsofrictionless_error_colselect_mismatch(L211-221) can deal with that. Alternatively and not preferred: drop thefrictionless_error_colselect_mismatchcheck and handle vroom errors likeError in UseMethod("collector_value"). - [ ] Write tests for this behaviour.
- [x] Extend
col_selectfrom CSV resources to all resources (including df and data). Described in #197