typesense-docsearch-scraper Scraper is unreliable: Some pages are not found at times.

Description

The number of pages and records the scraper finds and processes varies greatly.

Steps to reproduce

Reproduce the issue by using the provided repository or see the screenshot https://github.com/juhamust/docusaurus-typesense-search

Screenshot from subsequent runs:

Expected Behavior

The number of records should remain consistent.

Actual Behavior

Metadata

Running both server and scraper within Docker containers running in MacOS. The target website is quite a vanilla Docusaurus website. The Docusaurus website is built and served using Docusaurus.

Typesense Scraper Version: typesense/docsearch-scraper:0.11.0 Typesense Version: typesense/typesense:27.1

Jan 02 '25 15:01 juhamust

Thank you for reporting this issue with detailed reproduction steps. I've identified and fixed the root cause of the inconsistent page scraping behavior.

The problem was in how we handled page loading - we were using a static delay which didn't account for varying page load times and dynamic content. This led to inconsistent results where some content might not be fully loaded before scraping began.

I've submitted #80 to address that:

Replace static delays with dynamic page load detection
Add proper fallback behavior if a page takes too long to load
Ensure all dynamic content is properly loaded before scraping

The fix uses Selenium's WebDriverWait to actively check when the page is fully loaded rather than using a fixed timeout. In testing, this has resulted in consistent page counts between runs.

Could you please:

Pull the latest changes
Run a few test scrapes against your Docusaurus site
Verify that you're now getting consistent results between runs

On my Arch Linux (Kernel Linux 6.12.10-arch1-1) machine using the chromium arch repository package, this change produced stable results for over 14 consecutive runs.

Let me know if you're still seeing any variance in the number of pages scraped. I'm happy to investigate further if needed.

Feb 03 '25 13:02 tharropoulos

Thanks for the update and the fix! However, for some reason, I'm still seeing the variation in the outcome 😞

Steps were produced with the following setup:

MacOS 15.3
Typesense: 27.1, running in Docker container
Scraper: docker run -it --env-file=../.env -e "CONFIG=$(cat scraper-config.json | jq -r tostring)" typesense/docsearch-scraper:0.12.0.rc6, running in Docker container
Docusaurus 3.6.x (with build and serve, not dev mode), running in shell
Repository, including the instructions: https://github.com/juhamust/docusaurus-typesense-search

Feb 04 '25 12:02 juhamust

I tested it in your repository. Could you try replicating it in a containerized environment?

Feb 04 '25 13:02 tharropoulos

Sorry for the late reply: no matter how I try to run it, the result is the same for me 🫤 I'm now running the solution using Docker Compose:

container 1: typesense server
container 2: docuserver
container 3: typescript scraper

I still get the same issue. Or how are you running it?

Mar 02 '25 17:03 juhamust

I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4

Mar 02 '25 17:03 tharropoulos

I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4

It is very similar to mine, except on Mac. I wonder if I'm still somehow running the non-fixed version of Scraper 🤔 But, when looking at the files, it contains the changes introduced in #80 🤷

Mar 02 '25 18:03 juhamust

To ensure that's the case, maybe try cloning this, and running it locally. I'm using the pipenv shell with Python 3.10.16 here and it works. It does indeed not work for docker image every time.

Mar 03 '25 10:03 tharropoulos