browsertrix [Bug]: Problems when crawling ipres2022.scot

[Bug]: Problems when crawling ipres2022.scot

Open anjackson opened this issue 9 months ago • 4 comments

Browsertrix Version

v1.10.0-beta.5-a3911f6

What did you expect to happen? What happened instead?

During the IIPC session, I tried to archive https://ipres2022.scot/ via a seeded crawl:

That homepage crawl failed with a 'Page Worker Timeout'.
The sitemap did not appear to be accessed
Adding sitemap URL as an additional URL didn't seem to work.

Reproduction instructions

Make a new seeded crawl
Use https://ipres2022.scot/ as the seed
Watch the crawl hang until timeout

Screenshots / Video

No response

Environment

IIPC Browsertrix Instance

Additional details

No response

Apr 25 '24 13:04 anjackson

See also https://app.browsertrix.com/orgs/wac2024-workshop/items/crawl/manual-20240425222929-55b57b82-a92/review/screenshots?qaRunId=qa-20240425230548-55b57b82-a92&itemPageId=0a54db6a-0ff6-45a3-a1cd-3f583a4e5ae7 and note that the 'dual slider view' doesn't cope correctly in this situation.

Apr 29 '24 19:04 anjackson

I believe this is an issue with our sitemap parsing implementation

May 15 '24 20:05 tw4l

Ah, interesting, thank you. When I switch off the sitemap option, the homepage crawls and renders. Pretty good too - only missing the embedded video. EDIT: Would videos embedded like that be expected to work?

May 16 '24 14:05 anjackson

Would videos embedded like that be expected to work?

Typically, yes! Can look into that as well as the sitemap parsing (I think it might be a few more jumps between sitemaps than our current implementation expects, but will have to look more closely)

May 16 '24 16:05 tw4l

Sitemap parsing and image compare issues fixed. Video crawling to be investigated separately.

May 29 '24 21:05 ikreymer

browsertrix browsertrix copied to clipboard

[Bug]: Problems when crawling ipres2022.scot

Browsertrix Version

What did you expect to happen? What happened instead?

Reproduction instructions

Screenshots / Video

Environment

Additional details

browsertrix
browsertrix copied to clipboard