docs-scraper Doc Scraper removing old index on 2nd run

Initially created by @munim 2 days ago

Dear team,

I am trying out Meilisearch and indexing our side using docs-scraper project from Meilisearch. It worked for me at some level but when I ran the scraper again with the same command, it cleaned all the items and started from scratch. Here's what I did:

Created a Docker network and started Meilisearch with Docker:

$ docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='123'\
    -v $(pwd)/meili_data:/meili_data \
    --network="meilisearch-test-01" \
    getmeili/meilisearch:v0.28 \
    meilisearch --env="development"

Created a scraper config file mentioned in the project README
Started the scraper with the following command:

$ docker run -t --rm \
    -e MEILISEARCH_HOST_URL=http://exciting_banach:7700 \
    -e MEILISEARCH_API_KEY=123 \
    --network="meilisearch-test-01" \
    -v `pwd`/test-scraper.config.json:/docs-scraper/config.json \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper config.json

It took around 30 mins to scrap 50K pages.
I rerun the scraper after making some changes to the config
Now, I see all my previous entries from Meilisearch are removed and new entries are being added.

My question is: How can I update the entries rather than removing old entries and recreate again?

Aug 09 '22 10:08 gmourier

Hello there,

As far as I understand the described behaviour caused by this line.

Though removing old entries is a proper option in most cases, @munim can override this behaviour by changing source code.

Hope you're doing great, Matthew

Sep 02 '22 10:09 mdraevich

Hey @gmourier sorry for the delay!

Unfortunately we are not able to know between two scraping which documents are updated. The id's are generated randomly:

https://github.com/meilisearch/docs-scraper/blob/5b452af1b0b79be88c4449f6108fa50a28475a1a/scraper/src/strategies/default_strategy.py#L196-L202

If we do not remove the old documents, you are going to have your entries duplicated.

Sep 05 '22 15:09 bidoubiwa

If we do not remove the old documents, you are going to have your entries duplicated.

Hello, @bidoubiwa

I have a bit different view on the question... The digest hash for the same input will always be the immutable (the same). So in our case if the document has the same hierarchy_to_hash, url and position the value of hash function will be the same for first launch of docs-scraper and subsequent.

if URL (and its content) doesn't change the document objectID remains the same, so Meilisearch doesn't index the document twice. If URL (and its content) changes the document objectID changes, so the document is indexed by Meilisearch.

To finish things up, if existing URLs (and their contents) changes quite rarely, it may be reasonable to not delete old entries - it speeds up index process.

What do you think about that?

Best regards, Matthew

Sep 05 '22 17:09 mdraevich

Hey @mdraevich,

You are right, I did not give enough attention to what the script does. I think we initially decided to delete the index in order to wait less time for re-indexation when updating the settings. What happens is that when you add documents, they are being indexed in Meilisearch, and if then you add your settings, depending on the settings, Meilisearch has to re-index all the documents. To win time we decided to delete the index and start the process by adding the settings and then the documents.

Another huge issue is that we are not able to know which documents should be removed after an update. Lets say I scrap your website once, then you make changes and scrap it again. We are not able to remove the old entries that do not exist anymore on your app.

I think it might be a nice idea to either add the none-delete as an option.

What do you think ?

Update 1: in the next release of Meilisearch (v0.29.0), when updating documents that are exactly the same as their previous version, will not be re-indexed again.

Sep 06 '22 13:09 bidoubiwa

Hey, sorry I'm a bit late.

if URL (and its content) doesn't change the document objectID remains the same, so Meilisearch doesn't index the document twice. If URL (and its content) changes the document objectID changes, so the document is indexed by Meilisearch.

Meilisearch is already doing all that automatically; I don't think you have anything to do on docs-scraper side.

Personally, what I'm afraid of is if an URL (or document) is deleted, docs-scraper won't be able to tell since it doesn't know the last run, and thus it won't be deleted, right? The only way it could work to not delete everything is to ensure you always call docs-scraper from the same place + remember all the IDs you generated in the last run to be able to delete the unused ones:thinking:

May 15 '23 13:05 irevoire

Yes indeed @irevoire this is the issue :/

We could use the swap indexes to avoid a down time as an alternative which seems better than the current behavior.

May 15 '23 16:05 bidoubiwa

This repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

Sep 06 '23 11:09 alallema

docs-scraper docs-scraper copied to clipboard

Doc Scraper removing old index on 2nd run

docs-scraper
docs-scraper copied to clipboard