browsertrix-crawler
browsertrix-crawler copied to clipboard
Crawl run more than 2 days for a small WordPress site
I crawled https://stephenratcliffe.blogspot.com/ for 2 days and the crawl was still running. I decided to end the crawl. The blog is not big in byte. The author started writing everyday since 2009. ~ 70 words a day. It ended on Jul 22, 2023. All together, the blog has ~400,000 words. In some days, the posts included an image.
My colleague suggested:
The problem here is that the button to open the month was a element that had an onClick event which triggered an AJAX call. Since it wasn’t a button or an anchor tag the crawlers wouldn’t usually click on them, and so the AJAX response would never be made part of the archive.
I am using Mac i7, 16GB, 500 HDD; macOS 13.5 Browsertrix crawler 0.10.3
The blog can be handled through wget as well:
#!/bin/bash
wget -H \
--recursive \
--page-requisites \
--adjust-extension \
--span-hosts \
--convert-links \
--restrict-file-names=windows \
--domains "blogger.googleusercontent.com,stephenratcliffe.blogspot.com,bp.blogspot.com" \
--regex-type pcre \
--reject-regex ".*(showComment|feeds.*comments/default).*" \
--no-parent \
"http://stephenratcliffe.blogspot.com/"
The switches above will help wget focus on the right pages and resources. The downsides of this are:
- no parallelism/concurrency
- lack of js-awareness
The problem here is that the button to open the month was a element that had an onClick event which triggered an AJAX call
Yes, but in essence the blog is a static website. Even if there is a bit of js, there's still enough HTML link elements inside for the crawler to make its way through.
One note here: Even if the blog had links generated only through the evaluation of javascript and required a browser, blogspot/wordpress still respect proper Web standards and have a /sitemap.xml endpoint which provides you with all the urls. From there you can do wget -p -k on all of them to retrieve each page together with its resources. This time parallelism can be added via xargs or parallel.
Finally, let's say this site were dynamic and there's no other way than to use a js-aware crawler like browsertrix-crawler.
Here's a setup I wrote for it:
config.yaml:
seeds:
- url: https://stephenratcliffe.blogspot.com/
scopeType: "host"
include: (blogger.googleusercontent.com/img|stephenratcliffe.blogspot.com/(\d+/\d+)).*$
exclude: (blogger.com|stephenratcliffe.blogspot.com/feed|stephenratcliffe.blogspot.com/search|.*showComment).*$
sitemap: true
combineWARC: true
docker-compose.yml:
version: '3.1'
services:
crawler:
image: webrecorder/browsertrix-crawler:v0.10.4
environment:
- DISPLAY=:0.0
volumes:
- ./crawls:/crawls/
- ./config.yaml:/app/crawl-config.yaml
- /var/run/dbus:/var/run/dbus
- /run/dbus:/run/dbus
- /tmp/.X11-unix:/tmp/.X11-unix
- /home/user/.Xauthority:/root/.Xauthority
cap_add:
- NET_ADMIN
- SYS_ADMIN
shm_size: 2gb
privileged: true
command: "crawl --workers 4 --headless --config /app/crawl-config.yaml"
Make sure to create the crawls directory: mkdir crawls
Now you may start the crawl: docker-compose up
Notice the blogger.googleusercontent.com/img filter above, that's where the big HEIC images on that blog are stored.
I'd like to tell browsertrix that it's allowed to fetch those images too but I'm not quite sure how to do that (since the images are on a different domain). If anyone else would like to share some ideas on that I'd be grateful.
for 2 days and the crawl was still running
That seems excessive yes. Maybe try the alternatives above, they should be faster.
Versions of software used:
| name | version |
|---|---|
| docker | 24.0.6 |
| docker-compose | 2.19.1 |
| dockerd | 24.0.6 |
| browsertrix-crawler | v0.10.4 |
| wget | 1.21.2 |