linelist icon indicating copy to clipboard operation
linelist copied to clipboard

clean_spelling: allow multiple variables in the 'variable' column

Open thibautjombart opened this issue 5 years ago • 4 comments

The dictionary-based cleaning could use something like:

from      to        variable
hopsital  hospital  location|structure_type
hopital   hospital  location|structure_type
hopsital  hospital  location|structure_type
feild     field     location
homw      home      location
maison    home      location
household home      location
<NA>      unknown   .all
.default  unknown   location|structure_type|sex|exposure

Where the field variable illustrates the following new features:

  1. | to list several variables
  2. .all as a wildcard meaning "all variables"

A way to implement the above is to treat entries in variable as regular expressions to be matched against column names, with an exception rule for .all.

thibautjombart avatar Feb 20 '19 10:02 thibautjombart

the .all wildcard was named .global and it has already been implemented :grin:

zkamvar avatar Feb 20 '19 10:02 zkamvar

Hi Thibaut and Zhian,

I've implemented a .regex keyword for clean_variable_spelling() in my linelist branch, to allow matching multiple variables as Thibaut describes.

We initially went with a regex = TRUE argument, to treat all vars as regular expressions, but found it was cumbersome and inelegant to anchor all the variables for which we just wanted literal matches. So we switched to the .regex keyword approach, which has been working well in some of our linelist work at Epicentre.

Let me know if you're interested, and I can create a pull request.

patrickbarks avatar Oct 15 '19 08:10 patrickbarks

Hi Patrick that sounds great! PR most welcome, ideally with some new unit tests and an example in the doc of the function. Please also add yourself as a contributor in the DESCRIPTION file. But really cool to see contribs on this package, and to hear epicentre is using it :)

thibautjombart avatar Oct 15 '19 09:10 thibautjombart

That makes sense! It also aligns with the .regex keyword in the clean_spelling() function, so go ahead with the PR and I'll have a looksee

zkamvar avatar Oct 15 '19 10:10 zkamvar