browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

WACZ-files dowloaded from Browsertrix and then uploaded to Browsertrix using "Upload WACZ" contains 0 pages

Open Klindten opened this issue 6 months ago • 4 comments

Browsertrix Version

v1.18.0-c0cf6e6

What did you expect to happen? What happened instead?

I expected to be able to add/upload WACZ-files made in Browsertrix to be reused/uploaded in other Browsertrix-instances and have the same data/replay as the original crawl. This could be for building or consolidating collections.

Instead this happened: WACZ-files dowloaded from Browsertrix installations, local and app.browsertrix,and then uploaded using "Upload WACZ" in Browsertrix (local or app) contains 0 pages, still have the same size as before, and can not be replayed

WACZ-files made from archiveweb.page and uploaded to Browsertrix works fine and can be replayed (see attachec - notice "pages").

WACZ-files Files uploaded to replaywep.page have the expected behaviour and can replay/have pages.

Reproduction instructions

  1. Make a crawl using Browsertrix
  2. Download WACZ-file
  3. Upload WACZ-file to any Browsertrix installation
  4. Click Arhived Items >Uploads
  5. Notice the size is the same as the original crawl but pages are 0
  6. Click one of these uploaded files
  7. Click replay
  8. Notice theres "No Results Found

Screenshots / Video

Image Image

Environment

No response

Additional details

https://app.browsertrix.com/orgs/kb-btrix/items/upload https://app.browsertrix.com/orgs/kb-btrix/items/upload/upload-5cf2e22f-7b92-4d89-80d3-39b7f978a733#overview

Klindten avatar Aug 22 '25 18:08 Klindten

Hi Anders! Thanks for reporting this, definitely seems like a bug we'd want to address. By any chance, are the WACZ files you're uploading from Browsertrix multi-WACZs? Just a hunch but we might need to modify the upload process to account for them.

tw4l avatar Aug 25 '25 13:08 tw4l

I have probably done both. The last uploaded file was a lowly single WACZ-file ~50MB

Klindten avatar Aug 25 '25 13:08 Klindten

I have verified that this is an issue with multi-waczs, where our routine to read the pagelist on WACZ upload doesn't account for multi-WACZ. Unfortunately the remotezip library we're using to read the page lists without needing to download the entire WACZ file doesn't support nested zips: https://github.com/gtsystem/python-remotezip/issues/26. So we may need to either submit a PR for that ourselves or consider an alternate approach for multi-WACZ uploads.

tw4l avatar Sep 15 '25 21:09 tw4l

Related to #2648, which may mitigate this for some cases of this.

ikreymer avatar Sep 17 '25 22:09 ikreymer