scrapix Ensure same documents are not pushed more than once.

Ensure same documents are not pushed more than once.

Open bidoubiwa opened this issue 1 year ago • 1 comments

Context

Some websites have multiple URL's pointing to the same page. For example in openai:

https://platform.openai.com/docs/plugins/getting-started
https://platform.openai.com/docs/plugins/getting-started/plugin-manifest
https://platform.openai.com/docs/plugins/getting-started/running-a-plugin

Problem

Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.

The current solution would be to add a distinctAttribute: "content" in the meilisearch settings of your scrapix configuration.

Solution

The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields. For example in section_hash we add a hash of all the different fields:

hierarchy_lvl0
hierarchy_lvl1
hierarchy_lvl2
hierarchy_lvl3
hierarchy_lvl4
hierarchy_lvl5
hierarchy_radio_lvl0
hierarchy_radio_lvl1
hierarchy_radio_lvl2
hierarchy_radio_lvl3
hierarchy_radio_lvl4
hierarchy_radio_lvl5
content

We then add section_hash by default in the distinctAttributes here for example https://github.com/meilisearch/scrapix/blob/070c9074b8b313de8714575da7941054c7100ce5/src/scrapers/docssearch.ts#L13

But also in the default strategy

Jun 29 '23 16:06 bidoubiwa

scrapix scrapix copied to clipboard

Ensure same documents are not pushed more than once.

Context

Problem

Solution

scrapix
scrapix copied to clipboard