unconf17 icon indicating copy to clipboard operation
unconf17 copied to clipboard

Tidy record linkage package

Open 1danjordan opened this issue 7 years ago • 1 comments

Disparate or weakly linked data makes up the majority of the worlds data, but we focus mainly on single source datasets or combining datasets with definite primary and foreign keys. A number of tidyverse compliant packages exist for data cleansing and transformation but not for deduplication or record linkage. The problem of record linkage is complex and well studied, but there are no tools or framework that fits nicely into a modern R workflow.

The RecordLinkage package is a brilliant package that does solve this problem, but its API is inconsistent and data structures awkward. A tidy record linkage package could build from the lessons learned from RecordLinkage, while adhering to the "tidy way of life" and integrating with other tidy tools nicely. I think a package like this could open up a lot of possibilities for researchers and practitioners to working with and combing data they never could before.

1danjordan avatar Aug 09 '17 13:08 1danjordan

How do you feel about the fastLink package?

ck37 avatar Sep 11 '17 15:09 ck37