fuzzymatcher icon indicating copy to clipboard operation
fuzzymatcher copied to clipboard

URGENT:How to use Fuzzymatcher for one dataset

Open jennguo9 opened this issue 4 years ago • 2 comments

Hi all, Is there a way to identify duplicates in one file? The manual talks about comparing two data files, however, I only need to look for duplicates in one single file. I hope someone has a solution. Thanks in advance!

jennguo9 avatar Jun 10 '20 20:06 jennguo9

Hi @jennguo9

I hacked together a solution for my dataset. My solution joins the dataframe to a copy of itself and exclude any links where your unique identifier matches. In my case, it was for football players. I have one column I need to de-duplicate on: 'Name' and an identifier: 'player_id'.

import pandas as pd
import numpy as np
import os
from fuzzymatcher.data_preprocessor_default import DataPreprocessor
from fuzzymatcher.data_getter_sqlite import DataGetter
from fuzzymatcher.scorer_default import Scorer
from fuzzymatcher.matcher import Matcher

# load a dataframe

df_all_shots = pd.read_parquet(os.path.join('..', 'data', 'shots.parquet'))

# split into a left and right copy
df_left = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()
df_right = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()

# create a match object
dp = DataPreprocessor()
dg = DataGetter()
s = Scorer()
m = Matcher(dp, dg, s)

# match the data with a copy of itself
m.add_data(df_left, df_right, left_on='Name', right_on='Name',  left_id_col='player_id', right_id_col='player_id')
m.match_all()
df_duplicated = m.link_table

# subset the data where the ids don't match - these are the duplicates
df_duplicated = df_duplicated[df_duplicated.__id_left != df_duplicated.__id_right].copy()

# some duplicates will be matched twice (once left/ once right), so keep one copy.
# I had a numeric id so I create columns with the min/max of the ids and de-duplicated 
# based on these new id columns
df_duplicated['id1'] = df_duplicated[['__id_left', '__id_right']].min(axis=1)
df_duplicated['id2'] = df_duplicated[['__id_left', '__id_right']].max(axis=1)
df_duplicated.drop(['__id_left', '__id_right', '__rank'], axis=1, inplace=True)
df_duplicated.drop_duplicates(inplace=True)

# merge back on the columns you want to keep
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id1', right_on='player_id')
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id2', right_on='player_id', suffixes=['_1', '_2'])

Here's my result: image

andrewRowlinson avatar Jun 21 '20 10:06 andrewRowlinson

Thanks Andrew! I'm guessing the scores in this package does not mean anything since some of the matches are exactly the same. Also, my file has very noisy and messy data. There could be some instances where a record could have 4+ possible duplicates. I was hoping this package would be able to cluster the duplicates instead of giving me a one to one match record. I'm not sure if this package will work for me. Thanks for spending the time to respond!

On Sun, Jun 21, 2020 at 4:58 AM Andrew Rowlinson [email protected] wrote:

Hi @jennguo9 https://github.com/jennguo9

I hacked together a solution for my dataset. My solution joins the dataframe to a copy of itself and exclude any links where your unique identifier matches. In my case, it was for football players. I have one column I need to de-duplicate on: 'Name' and an identifier: 'player_id'.

` import pandas as pd import numpy as np import os from fuzzymatcher.data_preprocessor_default import DataPreprocessor from fuzzymatcher.data_getter_sqlite import DataGetter from fuzzymatcher.scorer_default import Scorer from fuzzymatcher.matcher import Matcher load a dataframe

df_all_shots = pd.read_parquet(os.path.join('..', 'data', 'shots.parquet')) split into a left and right copy

df_left = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy() df_right = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy() create a match object

dp = DataPreprocessor() dg = DataGetter() s = Scorer() m = Matcher(dp, dg, s) match the data with a copy of itself

m.add_data(df_left, df_right, left_on='Name', right_on='Name', left_id_col='player_id', right_id_col='player_id') m.match_all() df_duplicated = m.link_table subset the data where the ids don't match - these are the duplicates

df_duplicated = df_duplicated[df_duplicated.__id_left != df_duplicated.__id_right].copy() some duplicates will be matched twice (once left/ once right), so keep one copy. I had a numeric id so I create columns with the min/max of the ids and de-duplicated based on these new id columns

df_duplicated['id1'] = df_duplicated[['__id_left', '__id_right']].min(axis=1) df_duplicated['id2'] = df_duplicated[['__id_left', '__id_right']].max(axis=1) df_duplicated.drop(['__id_left', '__id_right', '__rank'], axis=1, inplace=True) df_duplicated.drop_duplicates(inplace=True) merge back on the columns you want to keep

df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id1', right_on='player_id') df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id2', right_on='player_id', suffixes=['_1', '_2']) ` Here's my result: [image: image] https://user-images.githubusercontent.com/33607496/85222879-27740480-b3c7-11ea-86b8-5ba1b6d8ad5a.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobinL/fuzzymatcher/issues/56#issuecomment-647112354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP5I2UFMRH4FDVZQ52EXDT3RXXRT3ANCNFSM4N2X7HQA .

jennguo9 avatar Jun 22 '20 15:06 jennguo9