scrapix
scrapix copied to clipboard
Ensure same documents are not pushed more than once.
Context
Some websites have multiple URL's pointing to the same page. For example in openai:
- https://platform.openai.com/docs/plugins/getting-started
- https://platform.openai.com/docs/plugins/getting-started/plugin-manifest
- https://platform.openai.com/docs/plugins/getting-started/running-a-plugin
Problem
Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.
The current solution would be to add a distinctAttribute: "content"
in the meilisearch settings of your scrapix configuration.
Solution
The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields.
For example in section_hash
we add a hash of all the different fields:
-
hierarchy_lvl0
-
hierarchy_lvl1
-
hierarchy_lvl2
-
hierarchy_lvl3
-
hierarchy_lvl4
-
hierarchy_lvl5
-
hierarchy_radio_lvl0
-
hierarchy_radio_lvl1
-
hierarchy_radio_lvl2
-
hierarchy_radio_lvl3
-
hierarchy_radio_lvl4
-
hierarchy_radio_lvl5
-
content
We then add section_hash
by default in the distinctAttributes here for example
https://github.com/meilisearch/scrapix/blob/070c9074b8b313de8714575da7941054c7100ce5/src/scrapers/docssearch.ts#L13
But also in the default strategy