browsertrix-crawler Crawl run more than 2 days for a small WordPress site

I crawled https://stephenratcliffe.blogspot.com/ for 2 days and the crawl was still running. I decided to end the crawl. The blog is not big in byte. The author started writing everyday since 2009. ~ 70 words a day. It ended on Jul 22, 2023. All together, the blog has ~400,000 words. In some days, the posts included an image.

My colleague suggested:

The problem here is that the button to open the month was a element that had an onClick event which triggered an AJAX call. Since it wasn’t a button or an anchor tag the crawlers wouldn’t usually click on them, and so the AJAX response would never be made part of the archive.

Aug 16 '23 19:08 peterchanws

I am using Mac i7, 16GB, 500 HDD; macOS 13.5 Browsertrix crawler 0.10.3

Aug 16 '23 20:08 peterchanws

The blog can be handled through wget as well:

#!/bin/bash
wget -H \
     --recursive \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --restrict-file-names=windows \
     --domains "blogger.googleusercontent.com,stephenratcliffe.blogspot.com,bp.blogspot.com" \
     --regex-type pcre \
     --reject-regex ".*(showComment|feeds.*comments/default).*" \
     --no-parent \
         "http://stephenratcliffe.blogspot.com/"

The switches above will help wget focus on the right pages and resources. The downsides of this are:

no parallelism/concurrency
lack of js-awareness

The problem here is that the button to open the month was a element that had an onClick event which triggered an AJAX call

Yes, but in essence the blog is a static website. Even if there is a bit of js, there's still enough HTML link elements inside for the crawler to make its way through.

One note here: Even if the blog had links generated only through the evaluation of javascript and required a browser, blogspot/wordpress still respect proper Web standards and have a /sitemap.xml endpoint which provides you with all the urls. From there you can do wget -p -k on all of them to retrieve each page together with its resources. This time parallelism can be added via xargs or parallel.

Finally, let's say this site were dynamic and there's no other way than to use a js-aware crawler like browsertrix-crawler.

Here's a setup I wrote for it:

config.yaml:

seeds:
  - url: https://stephenratcliffe.blogspot.com/
    scopeType: "host"
    include: (blogger.googleusercontent.com/img|stephenratcliffe.blogspot.com/(\d+/\d+)).*$
    exclude: (blogger.com|stephenratcliffe.blogspot.com/feed|stephenratcliffe.blogspot.com/search|.*showComment).*$
    sitemap: true

combineWARC: true

docker-compose.yml:

version: '3.1'

services:
  crawler:
    image: webrecorder/browsertrix-crawler:v0.10.4
    environment:
      - DISPLAY=:0.0
    volumes:
      - ./crawls:/crawls/
      - ./config.yaml:/app/crawl-config.yaml
      - /var/run/dbus:/var/run/dbus
      - /run/dbus:/run/dbus
      - /tmp/.X11-unix:/tmp/.X11-unix
      - /home/user/.Xauthority:/root/.Xauthority

    cap_add:
      - NET_ADMIN
      - SYS_ADMIN

    shm_size: 2gb
    privileged: true

    command: "crawl --workers 4 --headless --config /app/crawl-config.yaml"

Make sure to create the crawls directory: mkdir crawls

Now you may start the crawl: docker-compose up

Notice the blogger.googleusercontent.com/img filter above, that's where the big HEIC images on that blog are stored. I'd like to tell browsertrix that it's allowed to fetch those images too but I'm not quite sure how to do that (since the images are on a different domain). If anyone else would like to share some ideas on that I'd be grateful.

for 2 days and the crawl was still running

That seems excessive yes. Maybe try the alternatives above, they should be faster.

Versions of software used:

name	version
docker	24.0.6
docker-compose	2.19.1
dockerd	24.0.6
browsertrix-crawler	v0.10.4
wget	1.21.2

Sep 07 '23 23:09 wsdookadr

browsertrix-crawler browsertrix-crawler copied to clipboard

Crawl run more than 2 days for a small WordPress site

browsertrix-crawler
browsertrix-crawler copied to clipboard