browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Problems crawling buildcanada.com

Open mveytsman opened this issue 11 months ago • 2 comments

I have tried archiving the page with ArchiveWeb.page and it replays successfully within the app/ReplayWeb.page. When trying to crawl it with Browsertrix, the page does not replay succesfully:

Here's a screenshot of what I expect at https://buildcanada.com/memos Image

Here's what I see in the replay

Image

I also experienced some text problems on a page that does load (https//buildcanada.com):

Image

You can see some of these in QA with the extracted text difference on this crawl

I ran this by @Shrinks99 and he ran a crawl using the beta channel and it delivered improved results but did not complete as successfully as ArchiveWeb.page.

mveytsman avatar Feb 05 '25 23:02 mveytsman

After some further investigation with glogg, it seems that the text issues present in this archive are actually written into the WARCs

Image

This does not seem to be present in the response (as of today) as viewed in Firefox:

Image

ReplayWeb.page seems to render them correctly sometimes... When loaded from the WARC file itself the page presents no issues. When loaded from the WACZ it always renders incorrectly, though will render correctly once reloaded when viewing using ReplayWeb.page or with the desktop app[^1]

Image

[^1]: I tested this behaviour while ensuring I purged the cache to nail it down. I am unable to reproduce this behaviour in the embedded Browsertrix viewer which is especially strange.

Shrinks99 avatar Feb 26 '25 18:02 Shrinks99

I am facing similar text/charset/encoding issues when crawling http://rsarchive.org using a local/Kubernetes deployment.

Sample URL: https://rsarchive.org/Lectures/GA051/English/UNK1970/HisMid_index.html

Original: Image

Inside the "Replay" section from Browsertrix: Image

dcominottim avatar Apr 30 '25 19:04 dcominottim