WACZ-files dowloaded from Browsertrix and then uploaded to Browsertrix using "Upload WACZ" contains 0 pages
Browsertrix Version
v1.18.0-c0cf6e6
What did you expect to happen? What happened instead?
I expected to be able to add/upload WACZ-files made in Browsertrix to be reused/uploaded in other Browsertrix-instances and have the same data/replay as the original crawl. This could be for building or consolidating collections.
Instead this happened: WACZ-files dowloaded from Browsertrix installations, local and app.browsertrix,and then uploaded using "Upload WACZ" in Browsertrix (local or app) contains 0 pages, still have the same size as before, and can not be replayed
WACZ-files made from archiveweb.page and uploaded to Browsertrix works fine and can be replayed (see attachec - notice "pages").
WACZ-files Files uploaded to replaywep.page have the expected behaviour and can replay/have pages.
Reproduction instructions
- Make a crawl using Browsertrix
- Download WACZ-file
- Upload WACZ-file to any Browsertrix installation
- Click Arhived Items >Uploads
- Notice the size is the same as the original crawl but pages are 0
- Click one of these uploaded files
- Click replay
- Notice theres "No Results Found
Screenshots / Video
Environment
No response
Additional details
https://app.browsertrix.com/orgs/kb-btrix/items/upload https://app.browsertrix.com/orgs/kb-btrix/items/upload/upload-5cf2e22f-7b92-4d89-80d3-39b7f978a733#overview
Hi Anders! Thanks for reporting this, definitely seems like a bug we'd want to address. By any chance, are the WACZ files you're uploading from Browsertrix multi-WACZs? Just a hunch but we might need to modify the upload process to account for them.
I have probably done both. The last uploaded file was a lowly single WACZ-file ~50MB
I have verified that this is an issue with multi-waczs, where our routine to read the pagelist on WACZ upload doesn't account for multi-WACZ. Unfortunately the remotezip library we're using to read the page lists without needing to download the entire WACZ file doesn't support nested zips: https://github.com/gtsystem/python-remotezip/issues/26. So we may need to either submit a PR for that ourselves or consider an alternate approach for multi-WACZ uploads.
Related to #2648, which may mitigate this for some cases of this.