fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

Not all matches returned using regex_left_join

Open aminards opened this issue 3 years ago • 1 comments

I have two data frames. I need to merge them based on a partial string match.
Data frame A has Gene.Name column with EHBP1. Data frame B has Gene.Symbols column with

CLEC7A,EHBP1
CLEC7A,EHBP1
MBL2,CLEC7A,EHBP1
MBL2,CLEC7A,EHBP1,HTR2A
MBL2,CLEC7A,EHBP1,HTR2A
EHBP1,HTR2A
EHBP1,HTR2A
MBL2,CLEC7A,EHBP1,HTR2A
EHBP1
EHBP1
EHBP1
EHBP1
EHBP1
EHBP1
TBX15,MBL2,SNORD54,CLEC7A,RREB1,MRPL51,GGTLC2,MIR30A,SETMAR,GFOD1,STK33,KHDRBS2,EHBP1,RCL1,HTR2A

When I run the following command: mydata <- regex_left_join(A, B, by = c(Gene.Name = "Gene.Symbols"))

Only some of the matches are returned. I get only these matches:

EHBP1
EHBP1
EHBP1
EHBP1
EHBP1
EHBP1

Why am I not getting these remaining matches?

MBL2,CLEC7A,EHBP1,HTR2A
CLEC7A,EHBP1
CLEC7A,EHBP1
MBL2,CLEC7A,EHBP1
MBL2,CLEC7A,EHBP1,HTR2A
EHBP1,HTR2A
EHBP1,HTR2A
MBL2,CLEC7A,EHBP1,HTR2A
TBX15,MBL2,SNORD54,CLEC7A,RREB1,MRPL51,GGTLC2,MIR30A,SETMAR,GFOD1,STK33,KHDRBS2,EHBP1,RCL1,HTR2A

aminards avatar Aug 25 '21 16:08 aminards

Because the regex expression should be on the right, you might need :

mydata <- regex_right_join(B, A, by = c(Gene.Symbols = "Gene.Name"))

moodymudskipper avatar Oct 25 '21 08:10 moodymudskipper