browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Automatically crawl `<form>` URLs when `method` is `get`

Open benoit74 opened this issue 4 months ago • 0 comments

I have setup a test page at https://website.test.openzim.org/form-get.html

This is a simplified version of something we have encountered in the wild on two occasions.

First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.

Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.

These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.

This is what has been repoduced on https://website.test.openzim.org/form-get.html

Currently Browsertrix crawler does not extract links from this kind of form / combobox, as can be seen by running following command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.3 crawl --url "https://website.test.openzim.org/form-get.html" --cwd /output 

It would help to be able to automatically crawl these.

Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.

For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.

benoit74 avatar Oct 11 '24 19:10 benoit74