recordlinkage icon indicating copy to clipboard operation
recordlinkage copied to clipboard

Max weighted matching

Open jpweytjens opened this issue 5 years ago • 6 comments

This pull request implements max weighted bipartite graph matching as a method for the OneToOne class. It uses networkx for the actual matching algorithm. Multiple helper functions are defined to transform the MultiIndex output from recordlinkage to a graph object and back.

jpweytjens avatar Jul 08 '19 12:07 jpweytjens

You can easily visualize (small) graphs with the following code

def draw_weighted_bipartite_graph(graph):
    """Draw a weighted bipartite graph. 
    No garantuees are made about the order of the nodes."""

    labels = nx.get_edge_attributes(graph, "weight")
    nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0]
    pos = nx.drawing.layout.bipartite_layout(graph, nodes)

    nx.draw_networkx_edge_labels(graph, pos, labels)
    nx.draw_networkx(graph, pos)

though I'm not sure recordlinkage should contain such a function?

jpweytjens avatar Jul 08 '19 12:07 jpweytjens

Codecov Report

Merging #107 into master will increase coverage by 0.54%. The diff coverage is 95.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage    77.2%   77.75%   +0.54%     
==========================================
  Files          33       33              
  Lines        2338     2400      +62     
  Branches      376      391      +15     
==========================================
+ Hits         1805     1866      +61     
+ Misses        408      407       -1     
- Partials      125      127       +2
Impacted Files Coverage Δ
recordlinkage/network.py 91.96% <95.71%> (+9.27%) :arrow_up:
recordlinkage/base.py 80.31% <0%> (-0.17%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 87a5f4a...470985f. Read the comment docs.

codecov[bot] avatar Jul 08 '19 12:07 codecov[bot]

If I find the time, I'm interesting in adding the symmetric best matching and stable matching methods from Franke et. al as well. These methods seem to produce comparable results, for a much lower computational cost.

jpweytjens avatar Jul 08 '19 12:07 jpweytjens

Thanks @jpweytjens

You can easily visualize (small) graphs with the following code

def draw_weighted_bipartite_graph(graph):
    """Draw a weighted bipartite graph. 
    No garantuees are made about the order of the nodes."""

    labels = nx.get_edge_attributes(graph, "weight")
    nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0]
    pos = nx.drawing.layout.bipartite_layout(graph, nodes)

    nx.draw_networkx_edge_labels(graph, pos, labels)
    nx.draw_networkx(graph, pos)

though I'm not sure recordlinkage should contain such a function?

That would be nice to have in the documentation or on the examples page. If it turns out to be a succes, we add it to the toolkit.

Documentation: https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html?highlight=onetoone#network

J535D165 avatar Jul 12 '19 14:07 J535D165

I've added a new commit that should address all the requested changes.

  • A new function add_weights is introduced to add weights to candidate matches with 3 different methods. This function returns a Series instead of DataFrame.
  • Series are now used throughout the max_weighted matching function, instead of DataFrames.
  • networkx has been removed from tox and added to setup.py
  • Documentation has been added for the max_weighted method and an example is given for the new add_weights function.

I'm curious to hear your thoughts on this new version. If you're happy with them, I would like to add a test for the max_weighted method.

jpweytjens avatar Nov 27 '19 10:11 jpweytjens

Can you review the changes @J535D165 ?

jpweytjens avatar Jan 17 '24 23:01 jpweytjens