browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

[Bug]: Problems when crawling ipres2022.scot

Open anjackson opened this issue 9 months ago • 4 comments

Browsertrix Version

v1.10.0-beta.5-a3911f6

What did you expect to happen? What happened instead?

During the IIPC session, I tried to archive https://ipres2022.scot/ via a seeded crawl:

  • That homepage crawl failed with a 'Page Worker Timeout'.
  • The sitemap did not appear to be accessed
  • Adding sitemap URL as an additional URL didn't seem to work.

Reproduction instructions

  1. Make a new seeded crawl
  2. Use https://ipres2022.scot/ as the seed
  3. Watch the crawl hang until timeout

Screenshots / Video

No response

Environment

IIPC Browsertrix Instance

Additional details

No response

anjackson avatar Apr 25 '24 13:04 anjackson

See also https://app.browsertrix.com/orgs/wac2024-workshop/items/crawl/manual-20240425222929-55b57b82-a92/review/screenshots?qaRunId=qa-20240425230548-55b57b82-a92&itemPageId=0a54db6a-0ff6-45a3-a1cd-3f583a4e5ae7 and note that the 'dual slider view' doesn't cope correctly in this situation.

anjackson avatar Apr 29 '24 19:04 anjackson

I believe this is an issue with our sitemap parsing implementation

tw4l avatar May 15 '24 20:05 tw4l

Ah, interesting, thank you. When I switch off the sitemap option, the homepage crawls and renders. Pretty good too - only missing the embedded video. EDIT: Would videos embedded like that be expected to work?

anjackson avatar May 16 '24 14:05 anjackson

Would videos embedded like that be expected to work?

Typically, yes! Can look into that as well as the sitemap parsing (I think it might be a few more jumps between sitemaps than our current implementation expects, but will have to look more closely)

tw4l avatar May 16 '24 16:05 tw4l

Sitemap parsing and image compare issues fixed. Video crawling to be investigated separately.

ikreymer avatar May 29 '24 21:05 ikreymer