Scraper is unreliable: Some pages are not found at times.
Description
The number of pages and records the scraper finds and processes varies greatly.
Steps to reproduce
Reproduce the issue by using the provided repository or see the screenshot https://github.com/juhamust/docusaurus-typesense-search
Screenshot from subsequent runs:
Expected Behavior
The number of records should remain consistent.
Actual Behavior
Metadata
Running both server and scraper within Docker containers running in MacOS. The target website is quite a vanilla Docusaurus website. The Docusaurus website is built and served using Docusaurus.
Typesense Scraper Version: typesense/docsearch-scraper:0.11.0
Typesense Version: typesense/typesense:27.1
Thank you for reporting this issue with detailed reproduction steps. I've identified and fixed the root cause of the inconsistent page scraping behavior.
The problem was in how we handled page loading - we were using a static delay which didn't account for varying page load times and dynamic content. This led to inconsistent results where some content might not be fully loaded before scraping began.
I've submitted #80 to address that:
- Replace static delays with dynamic page load detection
- Add proper fallback behavior if a page takes too long to load
- Ensure all dynamic content is properly loaded before scraping
The fix uses Selenium's WebDriverWait to actively check when the page is fully loaded rather than using a fixed timeout. In testing, this has resulted in consistent page counts between runs.
Could you please:
- Pull the latest changes
- Run a few test scrapes against your Docusaurus site
- Verify that you're now getting consistent results between runs
On my Arch Linux (Kernel Linux 6.12.10-arch1-1) machine using the chromium arch repository package, this change produced stable results for over 14 consecutive runs.
Let me know if you're still seeing any variance in the number of pages scraped. I'm happy to investigate further if needed.
Thanks for the update and the fix! However, for some reason, I'm still seeing the variation in the outcome 😞
Steps were produced with the following setup:
- MacOS 15.3
- Typesense: 27.1, running in Docker container
- Scraper:
docker run -it --env-file=../.env -e "CONFIG=$(cat scraper-config.json | jq -r tostring)" typesense/docsearch-scraper:0.12.0.rc6, running in Docker container - Docusaurus 3.6.x (with build and serve, not dev mode), running in shell
- Repository, including the instructions: https://github.com/juhamust/docusaurus-typesense-search
I tested it in your repository. Could you try replicating it in a containerized environment?
Sorry for the late reply: no matter how I try to run it, the result is the same for me 🫤 I'm now running the solution using Docker Compose:
- container 1: typesense server
- container 2: docuserver
- container 3: typescript scraper
I still get the same issue. Or how are you running it?
I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4
I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4
It is very similar to mine, except on Mac. I wonder if I'm still somehow running the non-fixed version of Scraper 🤔 But, when looking at the files, it contains the changes introduced in #80 🤷
To ensure that's the case, maybe try cloning this, and running it locally. I'm using the pipenv shell with Python 3.10.16 here and it works. It does indeed not work for docker image every time.