linelist
linelist copied to clipboard
clean_spelling: allow multiple variables in the 'variable' column
The dictionary-based cleaning could use something like:
from to variable
hopsital hospital location|structure_type
hopital hospital location|structure_type
hopsital hospital location|structure_type
feild field location
homw home location
maison home location
household home location
<NA> unknown .all
.default unknown location|structure_type|sex|exposure
Where the field variable
illustrates the following new features:
-
|
to list several variables -
.all
as a wildcard meaning "all variables"
A way to implement the above is to treat entries in variable
as regular expressions to be matched against column names, with an exception rule for .all
.
the .all
wildcard was named .global
and it has already been implemented :grin:
Hi Thibaut and Zhian,
I've implemented a .regex
keyword for clean_variable_spelling()
in my linelist branch, to allow matching multiple variables as Thibaut describes.
We initially went with a regex = TRUE
argument, to treat all vars as regular expressions, but found it was cumbersome and inelegant to anchor all the variables for which we just wanted literal matches. So we switched to the .regex
keyword approach, which has been working well in some of our linelist work at Epicentre.
Let me know if you're interested, and I can create a pull request.
Hi Patrick
that sounds great! PR most welcome, ideally with some new unit tests and an example in the doc of the function. Please also add yourself as a contributor in the DESCRIPTION
file. But really cool to see contribs on this package, and to hear epicentre is using it :)
That makes sense! It also aligns with the .regex keyword in the clean_spelling()
function, so go ahead with the PR and I'll have a looksee