reclin icon indicating copy to clipboard operation
reclin copied to clipboard

Link() generates more pairs than the original samples

Open zlkrvsm opened this issue 4 years ago • 0 comments

This issue seems similar to #2, but I'm not encountering problems during pair generation.

Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.

In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:

id sex age return_visit date
1 M 25 TRUE Aug 01
2 F 19 TRUE Sep 29
3 M 25 FALSE Sep 15
4 F 19 FALSE Jul 19

I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.

So far so good.

I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use link(), and it returns 3.3 million records.

Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and link()? Is it an issue with using datasets that have the same number of variables and who all have the same names?

Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.

Stage Number of records
return visits 550 thousand patients
first visits 2.8 million patients
after blocking 1.07 million pairs
after filtering 83 thousand pairs
after linking 3.3 million records

zlkrvsm avatar Oct 29 '20 15:10 zlkrvsm