recordlinkage
recordlinkage copied to clipboard
Window indexing algorithm
I'm kinda stuck on trying to do a custom blocking function :
I would want to index by a date interval between df_a['start'] + 2 days >= df_b['start'] let's say.
I just can't figure out how to implement a function to return a multiIndex like this. Any clues?
Thank you so much for a such a great toolkit :) !
@J535D165 would you have maybe an idea on how to do this? :/ thanks!
hello @kevohagan
You are looking for an Adaptive Sorted Neighbourhood Indexing method. This is not implemented, but in your case, you can easily get very similar results with the Sorted Neighbourhood Indexing method.
# Convert the start day to a number.
df_a['start_unix'] = (df_a['start'] - pd.datetime(1970, 1, 1)).days
df_b['start_unix'] = (df_b['start'] - pd.datetime(1970, 1, 1)).days - 1
# SNI indexer
indexer = recordlinkage.SortedNeighbourhoodIndex(left_on='start_unix', right_on='start_unix', window=3)
indexer.index(df_a, df_b)
Or do your own merge (check the source code of BlockIndex and SNI) for details.
Hope it helps. I will take a look at how we can support an algorithm like this.
Okay thank you for the hint! I'll have a try and let you know :)