scrapix icon indicating copy to clipboard operation
scrapix copied to clipboard

Ensure same documents are not pushed more than once.

Open bidoubiwa opened this issue 1 year ago • 1 comments

Context

Some websites have multiple URL's pointing to the same page. For example in openai:

  • https://platform.openai.com/docs/plugins/getting-started
  • https://platform.openai.com/docs/plugins/getting-started/plugin-manifest
  • https://platform.openai.com/docs/plugins/getting-started/running-a-plugin

Problem

Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.

The current solution would be to add a distinctAttribute: "content" in the meilisearch settings of your scrapix configuration.

Solution

The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields. For example in section_hash we add a hash of all the different fields:

  • hierarchy_lvl0
  • hierarchy_lvl1
  • hierarchy_lvl2
  • hierarchy_lvl3
  • hierarchy_lvl4
  • hierarchy_lvl5
  • hierarchy_radio_lvl0
  • hierarchy_radio_lvl1
  • hierarchy_radio_lvl2
  • hierarchy_radio_lvl3
  • hierarchy_radio_lvl4
  • hierarchy_radio_lvl5
  • content

We then add section_hash by default in the distinctAttributes here for example https://github.com/meilisearch/scrapix/blob/070c9074b8b313de8714575da7941054c7100ce5/src/scrapers/docssearch.ts#L13

But also in the default strategy

bidoubiwa avatar Jun 29 '23 16:06 bidoubiwa