backend icon indicating copy to clipboard operation
backend copied to clipboard

add story_title_parts table to replace in memory work

Open hroberts opened this issue 5 years ago • 0 comments

The current story title deduping system in the topic spider creates a giant table of all the parts of all story titles in a given topic in a given media source. It breaks down each title into parts separate by common delimiters like :, -, and |, and then looks for any title parts that are greater than some minimum length and appear as the entire title of at least one story. This lets us match story pairs like 'Democrats See Opening Against Trump' and 'New York Times - Democrats See Opening Against Trunmp'.

I would like to try to replace this with a simpler system that just stores any title part greater than some length into a postgres table (or maybe just the single longest title part greater than some minimum length) and then does a sql query to find a duplicate story. My intuition after watching the title deduping for a couple of years is that this will have precision very close to the in memory version.

The table based approach has the great advantage that we can use it anywhere in our code base that we do duplication detection by title matching, including in the crawler, where we currently only match on exact titles (and so end up with more dups than we do in the topic mapper). It will also allow us to do a much better job of deduping stories when we merge stories between media for the domain media task.

hroberts avatar Mar 27 '19 01:03 hroberts