CanReg5 icon indicating copy to clipboard operation
CanReg5 copied to clipboard

C202304 - Deduplication search performance improvements

Open rlichainfotel opened this issue 1 year ago • 6 comments

There exists a matching algorithm using weights on different variables to establish a matching score between multiple (not-exactly) duplicate records. For example, Soundex is used for name variables.

The objective is to improve the algorithm performance:

  • First try to search for 'obvious' ways to improve the algorithm runtime performance. No more than a week. If nothing found please implement a blocking functionality described next.
  • Blocking: some of the variables could be set, like blocking on the sex, the birth date... This way, the algorithm could only retrieve male if sex was blocked for example. This functionality could be enabled by a small 'Block' tickbox on the deduplication screen.
  • Do not block CanReg either for too long (separate thread) while it performs the deduplication search, or add a timeout.

Be mindful to have measurements before and after improvements to demonstrate the progress.

rlichainfotel avatar Feb 08 '23 13:02 rlichainfotel