recordlinkage
recordlinkage copied to clipboard
Max weighted matching
This pull request implements max weighted bipartite graph matching as a method for the OneToOne
class. It uses networkx
for the actual matching algorithm. Multiple helper functions are defined to transform the MultiIndex
output from recordlinkage
to a graph
object and back.
You can easily visualize (small) graphs with the following code
def draw_weighted_bipartite_graph(graph):
"""Draw a weighted bipartite graph.
No garantuees are made about the order of the nodes."""
labels = nx.get_edge_attributes(graph, "weight")
nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0]
pos = nx.drawing.layout.bipartite_layout(graph, nodes)
nx.draw_networkx_edge_labels(graph, pos, labels)
nx.draw_networkx(graph, pos)
though I'm not sure recordlinkage
should contain such a function?
Codecov Report
Merging #107 into master will increase coverage by
0.54%
. The diff coverage is95.71%
.
@@ Coverage Diff @@
## master #107 +/- ##
==========================================
+ Coverage 77.2% 77.75% +0.54%
==========================================
Files 33 33
Lines 2338 2400 +62
Branches 376 391 +15
==========================================
+ Hits 1805 1866 +61
+ Misses 408 407 -1
- Partials 125 127 +2
Impacted Files | Coverage Δ | |
---|---|---|
recordlinkage/network.py | 91.96% <95.71%> (+9.27%) |
:arrow_up: |
recordlinkage/base.py | 80.31% <0%> (-0.17%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 87a5f4a...470985f. Read the comment docs.
If I find the time, I'm interesting in adding the symmetric best matching and stable matching methods from Franke et. al as well. These methods seem to produce comparable results, for a much lower computational cost.
Thanks @jpweytjens
You can easily visualize (small) graphs with the following code
def draw_weighted_bipartite_graph(graph): """Draw a weighted bipartite graph. No garantuees are made about the order of the nodes.""" labels = nx.get_edge_attributes(graph, "weight") nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0] pos = nx.drawing.layout.bipartite_layout(graph, nodes) nx.draw_networkx_edge_labels(graph, pos, labels) nx.draw_networkx(graph, pos)
though I'm not sure
recordlinkage
should contain such a function?
That would be nice to have in the documentation or on the examples page. If it turns out to be a succes, we add it to the toolkit.
Documentation: https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html?highlight=onetoone#network
I've added a new commit that should address all the requested changes.
- A new function
add_weights
is introduced to add weights to candidate matches with 3 different methods. This function returns a Series instead of DataFrame. - Series are now used throughout the
max_weighted
matching function, instead of DataFrames. - networkx has been removed from tox and added to
setup.py
- Documentation has been added for the
max_weighted
method and an example is given for the newadd_weights
function.
I'm curious to hear your thoughts on this new version. If you're happy with them, I would like to add a test for the max_weighted
method.
Can you review the changes @J535D165 ?