JobFunnel icon indicating copy to clipboard operation
JobFunnel copied to clipboard

TFIDF content matching should check inter-scrape

Open PaulMcInnis opened this issue 3 years ago • 2 comments

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

  1. scrape some jobs to .pkl
  2. copy-paste a row a few times, only changing the key_id
  3. run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

  • Build: 3.0.0

PaulMcInnis avatar Sep 12 '20 02:09 PaulMcInnis

Thinking about this... we can simply identify matches in the backwards direction after we perform the matching in the forwards direction? (Currently its incoming job vs master dict, but we want master job vs master dict)

PaulMcInnis avatar Sep 12 '20 19:09 PaulMcInnis

Accidentally tapped close...

PaulMcInnis avatar Sep 12 '20 19:09 PaulMcInnis