Problems crawling buildcanada.com
I have tried archiving the page with ArchiveWeb.page and it replays successfully within the app/ReplayWeb.page. When trying to crawl it with Browsertrix, the page does not replay succesfully:
Here's a screenshot of what I expect at https://buildcanada.com/memos
Here's what I see in the replay
I also experienced some text problems on a page that does load (https//buildcanada.com):
You can see some of these in QA with the extracted text difference on this crawl
I ran this by @Shrinks99 and he ran a crawl using the beta channel and it delivered improved results but did not complete as successfully as ArchiveWeb.page.
After some further investigation with glogg, it seems that the text issues present in this archive are actually written into the WARCs
This does not seem to be present in the response (as of today) as viewed in Firefox:
ReplayWeb.page seems to render them correctly sometimes... When loaded from the WARC file itself the page presents no issues. When loaded from the WACZ it always renders incorrectly, though will render correctly once reloaded when viewing using ReplayWeb.page or with the desktop app[^1]
[^1]: I tested this behaviour while ensuring I purged the cache to nail it down. I am unable to reproduce this behaviour in the embedded Browsertrix viewer which is especially strange.
I am facing similar text/charset/encoding issues when crawling http://rsarchive.org using a local/Kubernetes deployment.
Sample URL: https://rsarchive.org/Lectures/GA051/English/UNK1970/HisMid_index.html
Original:
Inside the "Replay" section from Browsertrix: