browsertrix
browsertrix copied to clipboard
[Bug]: Problems when crawling ipres2022.scot
Browsertrix Version
v1.10.0-beta.5-a3911f6
What did you expect to happen? What happened instead?
During the IIPC session, I tried to archive https://ipres2022.scot/ via a seeded crawl:
- That homepage crawl failed with a 'Page Worker Timeout'.
- The sitemap did not appear to be accessed
- Adding sitemap URL as an additional URL didn't seem to work.
Reproduction instructions
- Make a new seeded crawl
- Use https://ipres2022.scot/ as the seed
- Watch the crawl hang until timeout
Screenshots / Video
No response
Environment
IIPC Browsertrix Instance
Additional details
No response
See also https://app.browsertrix.com/orgs/wac2024-workshop/items/crawl/manual-20240425222929-55b57b82-a92/review/screenshots?qaRunId=qa-20240425230548-55b57b82-a92&itemPageId=0a54db6a-0ff6-45a3-a1cd-3f583a4e5ae7 and note that the 'dual slider view' doesn't cope correctly in this situation.
I believe this is an issue with our sitemap parsing implementation
Ah, interesting, thank you. When I switch off the sitemap option, the homepage crawls and renders. Pretty good too - only missing the embedded video. EDIT: Would videos embedded like that be expected to work?
Would videos embedded like that be expected to work?
Typically, yes! Can look into that as well as the sitemap parsing (I think it might be a few more jumps between sitemaps than our current implementation expects, but will have to look more closely)
Sitemap parsing and image compare issues fixed. Video crawling to be investigated separately.