reclin
reclin copied to clipboard
Add options to support pairs within a single data set
I'm using this package not for linking two different data sets, but analysing pairs within a single data set. In this context, reclin
creates more pairs than necessary.
For example, if input data sets x
and y
both refer to data set df
, then I don't want to compare x[1] with y[1] because they are the same record. Similarly I often don't want to compare both x[1]/y[2] and x[2]/y[1] because the comparison relationships are symmetric, e.g. f(x, y) = f(y, x).
Hi @jkeirstead ,
There is a vignette on this: https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html. You can use the function filter_pairs_for_deduplication
to remove the duplicate pairs. This still creates the pairs initially which is not nice for large data sets. I am working on reimplementing some parts of reclin and this will probably one of the things I will tackle. Don't know yet when this will be finished.
I had seen that vignette and noticed that it is filtering after pair creation. In my case, the bottleneck is actually in the comparison scoring so that method would work but actually I've just been using a tidyverse solutions: pairs %>% filter(x < y)
The package is really helpful though - thanks!
Parabéns pelo pacote, excelente iniciativa. Eu também usei sua vinheta para limpar os registros duplicados da minha base de dados de Covid19 em Belo Horizonte-MG (Brasil), no entanto extrapolou. Os erros apresentados foram os seguintes:
Error in length<-.lvec
(*tmp*
, value = lx + length(y)) :
std::bad_alloc
In addition: Warning message:
In lx + length(y) : NAs produced by integer overflow
Você tem alguma sugestão para fugir do erro, além de dividir os dados?
@jeanbarrado The package (/lvec
) doesn't handle more than 2^31 pairs, which judging from the error message seems to be the case here. Please first check the expected number of pairs: for deduplication without blocking you have (n^2 - n)*0.5
pairs. If this number is less than 2^31 it should in principle be possible to do deduplication. However, unfortunately my package currently first generates n^2
pairs. If you should have a final number of pairs less than 2^31, I think I can cook up a workaround.
But with around 2^31 pairs, computation time is going to be quite substantial.
A usual method of reducing the number of pairs, is to apply some sort of blocking: only generate pairs when the records agree on some variable, e.g. city/province/first letter of the name.
My portugese is not really well developed, in this case with the help of google translate I was able to follow the question, but please use english if you can.
Hi! Continuing to have this issue