readr
readr copied to clipboard
Registration of custom column types and parsers
This is a feature request that should not impact current readr
's behaviour, but that would open up the possibility of enhancement from third-party packages (cc @edzer).
Motivation
Consider the cases in which a rectangular data source contains, e.g.,
- Quantities with units and/or errors.
1.53(2) m/s
. - Spatial data:
POINT(0, 1)
.
readr::read_csv("
quantity,point
1.53(3) m/s,\"POINT(0,1)\"
5.21(1) m/s,\"POINT(1,5)\"
")
#> # A tibble: 3 x 2
#> quantity point
#> <chr> <chr>
#> 1 quantity point
#> 2 1.53(3) m/s POINT(0,1)
#> 3 5.21(1) m/s POINT(1,5)
Currently, those data types are read as character (and other use cases may be stripped to numbers), and the user needs to convert them. The idea would be to allow packages to register custom column types and parsers into readr
so that, in this example, if packages quantities
and sf
were loaded, readr
would have automatically generated columns with quantities
and sf
objects respectively.
Changes required
I could be missing something, but the general changes needed for this would be:
- Move things to
inst/include
to expose theCollector
class, so that other packages can link toreadr
and derive safely from this class. - Some mechanism to register
- (C++) the custom collector into the list of available subclasses (and the means to insert it in a specific position in the chain?) and a parser (guesser) function.
- (R) custom
col_*
andparse_*
.
The only drawback I can think of is that a package may, e.g., register a parser that catches everything and messes things up. To avoid this issue, readr::read_*
may gain a flag to enable external parsers, so that using them requires an action from the user.
If this enhancement is considered, I would be more than happy to work on it.
I'll second this! Would be very welcome if we could extend the functionality of readr
with other packages.
Our AMR package is all about antimicrobial resistance (AMR). The new class rsi
makes sure columns only contain valid antimicrobial interpretations: resistant (R), susceptible (S) or intermediate (I). This can be forced upon a vector with as.rsi
.
Anyway, if data are read from a microbiological laboratory system (in a hospital), all columns with results of antbiotics will contain just R, S or I values. Suchcolumns could be parsed as rsi
when the AMR package is loaded 😄
Consider this third-ed ! Some notes: Imo, allow users to supply parsers as a named list, and maybe to be able to override builtin parsers (Throw a warning if they try to do this with an option to suppress it, maybe).
With col_types, extend the abbreviation string assignment to allow users to instead use the key for their input parser.
This would still be awesome to have 🙄😄
I also have a use for this. In my case values are written to a character field that are probably best represented by a matrix in R. The CSV records look like this:
28255348,true,2008-06-20 19:00:00.000,,,"XY",5.0,"1883 2016 1897 2006 1890 1998 1928 2016 1916 2037 1900 2056 1857 2064 1897 2049 1889 2048 1894 2042 1883 2032 1909 2027 1892 2028 1890 2032 1888 2038 1894 2037 1898 2040 1897 2032 1896 2040 1896 2036 1908 2042 1895 2036 1897 2044 1904 2048 1900 2037 1900 2048 1902 2048 1907 2048 1907 2040 1897 2032 1901 2032 1906 2030 1901 2036 1897 2029 1898 2036 1907 2026 1896 2016 1901 2024 1902 2020 1904 2021 1902 2026 1895 2016 1898 2023 1907 2012 1900 2023 1893 2012 1904 2012 1892 2012 1891 1996 1952 2000 1883 2060 1892 2060 1896 2050 1901 2030 1902 2034 1906 2036 1912 2026 1924 2032 1912 2028 1910 2030",,,,69253874,,2008-06-20 19:00:00.000
These are measurements of acceleration made in a 5 hz burst over 2 axis. Alternatively If i code them as raw vectors I can considerably reduce the memory consumption (66 % reduction).
I also have a use case for this in {csvwr}
where I need to handle lots of different parsing options per the CSVW specification (which extends the XML Schema Datatypes standard). For example, I'd like to be able to configure col_logical
to parse e.g. "yes" & "no". I'm also wondering if I could use collectors to do some validation (e.g. that a string matches a regex or a number is within some bounds).
Contrary to the posts above, I'd suggest that parsers be specified explicitly with a cols()
specification rather than having them registered as a side-effect of loading a package. As convenient as the latter might be for authors of individual packages it won't be obvious to users and offers no way to reconcile conflicts between packages.
Likewise I wouldn't extend this to the compact string representation as that will quickly lead to collisions. We've already used a quarter of the (lower/ upper case alphabet) namespace between readr
and this thread!
Indeed I'd expect that some parsers will benefit from configuration (as do col_date
etc).
Instead I'd suggest this work with the col_types
argument e.g.
readr::read_csv("data.csv", col_type=list(mypackage::col_logical("yes"=T, "no"=F)))
One thing that strikes me looking at the source is that the collectors are implemented in C++, presumably for speed.
Would it be possible to modify Collector::create
such that it can resolve third-party collectors at runtime? All the possible collectors won't be available when this class is compiled so there would need to be some way to register/ find collectors.
This would still require that packages implement their collectors in C++. An alternative might be to allow people to provide collectors in R code, accepting that this might dramatically slow down parsing. This would lower the barriers to extension without prohibiting package authors from re-implementing their parsers in C++ if it proved too slow in R.