typesense-docsearch-scraper icon indicating copy to clipboard operation
typesense-docsearch-scraper copied to clipboard

Scraper is unreliable: Some pages are not found at times.

Open juhamust opened this issue 1 year ago • 7 comments

Description

The number of pages and records the scraper finds and processes varies greatly.

Steps to reproduce

Reproduce the issue by using the provided repository or see the screenshot https://github.com/juhamust/docusaurus-typesense-search

Screenshot from subsequent runs:

image

Expected Behavior

The number of records should remain consistent.

Actual Behavior

Metadata

Running both server and scraper within Docker containers running in MacOS. The target website is quite a vanilla Docusaurus website. The Docusaurus website is built and served using Docusaurus.

Typesense Scraper Version: typesense/docsearch-scraper:0.11.0 Typesense Version: typesense/typesense:27.1

juhamust avatar Jan 02 '25 15:01 juhamust

Thank you for reporting this issue with detailed reproduction steps. I've identified and fixed the root cause of the inconsistent page scraping behavior.

The problem was in how we handled page loading - we were using a static delay which didn't account for varying page load times and dynamic content. This led to inconsistent results where some content might not be fully loaded before scraping began.

I've submitted #80 to address that:

  • Replace static delays with dynamic page load detection
  • Add proper fallback behavior if a page takes too long to load
  • Ensure all dynamic content is properly loaded before scraping

The fix uses Selenium's WebDriverWait to actively check when the page is fully loaded rather than using a fixed timeout. In testing, this has resulted in consistent page counts between runs.

Could you please:

  1. Pull the latest changes
  2. Run a few test scrapes against your Docusaurus site
  3. Verify that you're now getting consistent results between runs

On my Arch Linux (Kernel Linux 6.12.10-arch1-1) machine using the chromium arch repository package, this change produced stable results for over 14 consecutive runs.

Let me know if you're still seeing any variance in the number of pages scraped. I'm happy to investigate further if needed.

tharropoulos avatar Feb 03 '25 13:02 tharropoulos

Thanks for the update and the fix! However, for some reason, I'm still seeing the variation in the outcome 😞

Image


Steps were produced with the following setup:

  • MacOS 15.3
  • Typesense: 27.1, running in Docker container
  • Scraper: docker run -it --env-file=../.env -e "CONFIG=$(cat scraper-config.json | jq -r tostring)" typesense/docsearch-scraper:0.12.0.rc6, running in Docker container
  • Docusaurus 3.6.x (with build and serve, not dev mode), running in shell
  • Repository, including the instructions: https://github.com/juhamust/docusaurus-typesense-search

juhamust avatar Feb 04 '25 12:02 juhamust

I tested it in your repository. Could you try replicating it in a containerized environment?

tharropoulos avatar Feb 04 '25 13:02 tharropoulos

Sorry for the late reply: no matter how I try to run it, the result is the same for me 🫤 I'm now running the solution using Docker Compose:

  • container 1: typesense server
  • container 2: docuserver
  • container 3: typescript scraper

I still get the same issue. Or how are you running it?

juhamust avatar Mar 02 '25 17:03 juhamust

I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4

tharropoulos avatar Mar 02 '25 17:03 tharropoulos

I'm running the Docscraper in docker, Typesense in docker as well. Docusaurus in running through Node 23 on Arch Linux, kernel 6.12.4

It is very similar to mine, except on Mac. I wonder if I'm still somehow running the non-fixed version of Scraper 🤔 But, when looking at the files, it contains the changes introduced in #80 🤷

juhamust avatar Mar 02 '25 18:03 juhamust

To ensure that's the case, maybe try cloning this, and running it locally. I'm using the pipenv shell with Python 3.10.16 here and it works. It does indeed not work for docker image every time.

tharropoulos avatar Mar 03 '25 10:03 tharropoulos