browsertrix-crawler
browsertrix-crawler copied to clipboard
Automatically crawl `<form>` URLs when `method` is `get`
I have setup a test page at https://website.test.openzim.org/form-get.html
This is a simplified version of something we have encountered in the wild on two occasions.
First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.
Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.
These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.
This is what has been repoduced on https://website.test.openzim.org/form-get.html
Currently Browsertrix crawler does not extract links from this kind of form / combobox, as can be seen by running following command:
docker run -v $PWD/output:/output --name crawlme --rm webrecorder/browsertrix-crawler:1.3.3 crawl --url "https://website.test.openzim.org/form-get.html" --cwd /output
It would help to be able to automatically crawl these.
Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.
For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.