reclin
reclin copied to clipboard
Link() generates more pairs than the original samples
This issue seems similar to #2, but I'm not encountering problems during pair generation.
Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.
In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:
id | sex | age | return_visit | date |
---|---|---|---|---|
1 | M | 25 | TRUE | Aug 01 |
2 | F | 19 | TRUE | Sep 29 |
3 | M | 25 | FALSE | Sep 15 |
4 | F | 19 | FALSE | Jul 19 |
I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.
So far so good.
I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use link()
, and it returns 3.3 million records.
Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and link()
? Is it an issue with using datasets that have the same number of variables and who all have the same names?
Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.
Stage | Number of records |
---|---|
return visits | 550 thousand patients |
first visits | 2.8 million patients |
after blocking | 1.07 million pairs |
after filtering | 83 thousand pairs |
after linking | 3.3 million records |