recordlinkage icon indicating copy to clipboard operation
recordlinkage copied to clipboard

Window indexing algorithm

Open kevohagan opened this issue 7 years ago • 3 comments

I'm kinda stuck on trying to do a custom blocking function :

I would want to index by a date interval between df_a['start'] + 2 days >= df_b['start'] let's say.

I just can't figure out how to implement a function to return a multiIndex like this. Any clues?

Thank you so much for a such a great toolkit :) !

kevohagan avatar Jan 15 '18 15:01 kevohagan

@J535D165 would you have maybe an idea on how to do this? :/ thanks!

kevohagan avatar Jan 24 '18 23:01 kevohagan

hello @kevohagan

You are looking for an Adaptive Sorted Neighbourhood Indexing method. This is not implemented, but in your case, you can easily get very similar results with the Sorted Neighbourhood Indexing method.

# Convert the start day to a number. 
df_a['start_unix'] = (df_a['start'] - pd.datetime(1970, 1, 1)).days
df_b['start_unix'] = (df_b['start'] - pd.datetime(1970, 1, 1)).days - 1

# SNI indexer
indexer = recordlinkage.SortedNeighbourhoodIndex(left_on='start_unix', right_on='start_unix', window=3) 
indexer.index(df_a, df_b)

Or do your own merge (check the source code of BlockIndex and SNI) for details.

Hope it helps. I will take a look at how we can support an algorithm like this.

J535D165 avatar Jan 25 '18 21:01 J535D165

Okay thank you for the hint! I'll have a try and let you know :)

kevohagan avatar Jan 26 '18 11:01 kevohagan