browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

[Feature]: Allow sitemaps to be specified as a seed URL

Open thsm-kb opened this issue 2 years ago • 1 comments

Browsertrix Cloud Version

v1.8.0-beta.2-3aebf2e

What did you expect to happen? What happened instead?

If I use https://www.sn.dk/sitemaps/term/Place.Sitemap.0.xml as a seed, it is not crawled. I do not get the files it links to.

Step-by-step reproduction instructions

  1. try crawling https://www.sn.dk/sitemaps/term/Place.Sitemap.0.xml as a seed.

Additional details

No response

thsm-kb avatar Nov 07 '23 08:11 thsm-kb

Currently this is not something the crawler supports, it only works with HTML pages as seeds. The sitemap has to be passed as a --sitemap parameter. We could look into supporting sitemaps as seeds.

ikreymer avatar Jan 10 '24 21:01 ikreymer