browsertrix
browsertrix copied to clipboard
[Feature]: Allow sitemaps to be specified as a seed URL
Browsertrix Cloud Version
v1.8.0-beta.2-3aebf2e
What did you expect to happen? What happened instead?
If I use https://www.sn.dk/sitemaps/term/Place.Sitemap.0.xml as a seed, it is not crawled. I do not get the files it links to.
Step-by-step reproduction instructions
- try crawling https://www.sn.dk/sitemaps/term/Place.Sitemap.0.xml as a seed.
Additional details
No response
Currently this is not something the crawler supports, it only works with HTML pages as seeds. The sitemap has to be passed as a --sitemap parameter. We could look into supporting sitemaps as seeds.