pywb
pywb copied to clipboard
Replay of images failes in pywb (works with archiveweb.page): warcio.exceptions.ArchiveLoadFailed: Original for revisit record could not be loaded
Expected behavior
Pictures contained in the attached warc.gz should be served and displayed. republik1_0.warc.gz republik1_1.warc.gz
Both warc.gz are exactly the same crawl (done with archiveweb.page. Just WARC1/0 vs. WARC1/1; in fact none of them works in pywb - but does in archiveweb.page).
What actually happened
- Index one of the attached warc.gz with wb-manager
- Open this page in your pywb: https://www.republik.ch/2022/02/08/mitdebattieren-leicht-gemacht
- Scroll a little down until the images (gifs) should appear.
- They are missing (although being contained in the crawl)
Log
127.0.0.1 - - [2023-01-25 09:57:35] "POST /test/resource/postreq?url=https%3A%2F%2Fcdn.repub.ch%2Fs3%2Frepublik-assets%2Frepos%2Frepublik%2Farticle-so-geht-das-hier%2Fimages%2F0a0eb9c4a1e48ed74b871aca9cd46d788a8b995d.gif%3Fsize%3D922x576%26format%3Dauto%26resize%3D922x&closest=20230125084600&matchType=exact HTTP/1.1" 503 442 0.034582 2023-01-25 09:57:35,446: [DEBUG]: http://localhost:54389 "POST /test/resource/postreq?url=https%3A%2F%2Fcdn.repub.ch%2Fs3%2Frepublik-assets%2Frepos%2Frepublik%2Farticle-so-geht-das-hier%2Fimages%2F0a0eb9c4a1e48ed74b871aca9cd46d788a8b995d.gif%3Fsize%3D922x576%26format%3Dauto%26resize%3D922x&closest=20230125084600&matchType=exact HTTP/1.1" 503 143 127.0.0.1 - - [2023-01-25 09:57:35] "GET /test/20230125084600im_/https://cdn.repub.ch/s3/republik-assets/repos/republik/article-so-geht-das-hier/images/0a0eb9c4a1e48ed74b871aca9cd46d788a8b995d.gif?size=922x576&format=auto&resize=922x HTTP/1.1" 503 1842 0.086629 2023-01-25 09:57:37,862: [DEBUG]: Starting new HTTP connection (1): localhost:54389 Dir .\collections\test\indexes\ unchanged Dir .\collections\test\indexes\ unchanged Traceback (most recent call last): File "C:\Tools\pywb\lib\site-packages\pywb-2.7.3-py3.6.egg\pywb\warcserver\handlers.py", line 160, in call out_headers, resp = loader(cdx, params) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.3-py3.6.egg\pywb\warcserver\resource\responseloader.py", line 37, in call entry = self.load_resource(cdx, params) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.3-py3.6.egg\pywb\warcserver\resource\responseloader.py", line 206, in load_resource local_index_query)) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.3-py3.6.egg\pywb\warcserver\resource\resolvingloader.py", line 89, in load_headers_and_payload cdx_loader) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.3-py3.6.egg\pywb\warcserver\resource\resolvingloader.py", line 217, in load_different_url_payload raise ArchiveLoadFailed(self.MISSING_REVISIT_MSG) warcio.exceptions.ArchiveLoadFailed: Original for revisit record could not be loaded NoneType: None 127.0.0.1 - - [2023-01-25 09:57:37] "POST /test/resource/postreq?url=https%3A%2F%2Fcdn.repub.ch%2Fs3%2Frepublik-assets%2Frepos%2Frepublik%2Farticle-so-geht-das-hier%2Fimages%2Fdca51f43ef2d23696d12898735a8b68b65eae0ac.gif%3Fsize%3D992x544%26format%3Dauto%26resize%3D992x&closest=20230125084600&matchType=exact HTTP/1.1" 503 442 0.048205 2023-01-25 09:57:37,915: [DEBUG]: http://localhost:54389 "POST /test/resource/postreq?url=https%3A%2F%2Fcdn.repub.ch%2Fs3%2Frepublik-assets%2Frepos%2Frepublik%2Farticle-so-geht-das-hier%2Fimages%2Fdca51f43ef2d23696d12898735a8b68b65eae0ac.gif%3Fsize%3D992x544%26format%3Dauto%26resize%3D992x&closest=20230125084600&matchType=exact HTTP/1.1" 503 143 127.0.0.1 - - [2023-01-25 09:57:37] "GET /test/20230125084600im/https://cdn.repub.ch/s3/republik-assets/repos/republik/article-so-geht-das-hier/images/dca51f43ef2d23696d12898735a8b68b65eae0ac.gif?size=992x544&format=auto&resize=992x HTTP/1.1" 503 1842 0.059209
Why can't pywb serve the response body with the same payload digest?
@tw4l : Were able to check it on your side in the meantime? Can you confirm, it is a pywb issue? (And can you fix it please? Thx a lot)
Hi @steph-nb , sorry I've been kept busy on other tasks so haven't had a chance to investigate yet but I will check it out soon! At first glance seems like a replay issue - replayweb.page and pywb use different replay systems so it's a fair bit of work to keep them at parity. We have a PR that handles JS modules better in pywb that might help. I'll update here when I've had a chance to look.
Hi @tw4l , yes of course, I understand. I faced the same issue again, and therefore just wanted to make sure that it is not forgotten ;-) Many thanks and BR
hello,
i have the same issue. A resource is not available in pywb replay (error message: "Original for revisit record could not be loaded"), but archiveweb.page shows the resource.
It seems that pywb is missing the connection of a revisit record and its related response record (in the case of the same payload digest).
The WARC was created with ArchiveWeb.Page.
Revist Record (not accessible in pywb)
WARC/1.1 WARC-Record-ID: urn:uuid:73dc1b7b-c7d5-5e09-ba46-35bc6e004fa8 WARC-Page-ID: 1cxudtjkxau5bsvxmlkrpu WARC-Payload-Digest: sha-256:2059c52230d630aff5090d2964125eee31961c7b17bfe19cc9efe00861be82d1 WARC-Target-URI: https://t4.bcbits.com/stream/40e3bc944b3f75aaefe21592ac44129b/mp3-128/3055200501?p=0&ts=1702117684&t=d5d33d77c1776323faf72e4de8a464c5312130eb&token=1702117684_76afa40256625ac803069964cc5db9674826c5db WARC-Date: 2023-12-08T10:28:30.630Z WARC-Type: revisit WARC-Profile: http://netpreserve.org/warc/1.1/revisit/identical-payload-digest WARC-Refers-To-Target-URI: https://t4.bcbits.com/stream/40e3bc944b3f75aaefe21592ac44129b/mp3-128/3055200501?p=0&ts=1702117656&t=e98d3aade08fc32e88c47764bc30293ecf6da025&token=1702117656_64a38ea3da9e4d05d198672b6e521d9ff830d3a5 WARC-Refers-To-Date: 2023-12-08T10:27:59.598Z Content-Type: application/http; msgtype=response Content-Length: 412
[...]
The referenced Target URI is in the warc file with payload:
Reponse Record
WARC/1.1 WARC-Record-ID: urn:uuid:5f44e89a-e28a-5f02-a6d9-05cc50a64440 WARC-Page-ID: 0v6oj9oy1c4amy9ugxo8x8l WARC-Target-URI: https://t4.bcbits.com/stream/40e3bc944b3f75aaefe21592ac44129b/mp3-128/3055200501?p=0&ts=1702117656&t=e98d3aade08fc32e88c47764bc30293ecf6da025&token=1702117656_64a38ea3da9e4d05d198672b6e521d9ff830d3a5 WARC-Date: 2023-12-08T10:27:59.598Z WARC-Type: response Content-Type: application/http; msgtype=response WARC-Payload-Digest: sha256:2059c52230d630aff5090d2964125eee31961c7b17bfe19cc9efe00861be82d1 WARC-Block-Digest: sha256:393b903c7df0cb34a57cabd7a265172e2d4fc824bacaab856f15816691c627bb Content-Length: 2894366 [...]
thx Mona
Hey folks, I think working revisit records are pretty important for pywb to properly replay. Otherwise we'll have to capture every page in a separate session. This bug severely limits the portability of web archives in between tools, like from AWP to pywb.
Hi folks, still this issue is unresolved. Therefore I upload another and maximally reduced example: 20180117_30-jahre-vor-gericht.warc.gz
It only consists of 3 captured URLs. One response with status 200 of this URL: https://cdn.repub.ch/s3/republik-assets/repos/republik/article-30-jahre-vor-gericht/images/e508169162623ea58b1069ff1e8c7128c15293b2.gif?size=850x567&format=auto&resize=333x And two revisits with these URLs: https://cdn.repub.ch/s3/republik-assets/repos/republik/article-30-jahre-vor-gericht/images/e508169162623ea58b1069ff1e8c7128c15293b2.gif?size=850x567&format=auto&resize=665x https://cdn.repub.ch/s3/republik-assets/repos/republik/article-30-jahre-vor-gericht/images/e508169162623ea58b1069ff1e8c7128c15293b2.gif?size=850x567&format=auto&resize=850x
All having the same digest of: sha256:bb5fe7c707048130403185e469d1e080fe2a50a270902e3c02da374b3b809454
still pywb is not able to serve the object the revisits point to:
Traceback (most recent call last): File "C:\Tools\pywb\lib\site-packages\pywb-2.7.4-py3.6.egg\pywb\warcserver\handlers.py", line 160, in call out_headers, resp = loader(cdx, params) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.4-py3.6.egg\pywb\warcserver\resource\responseloader.py", line 37, in call entry = self.load_resource(cdx, params) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.4-py3.6.egg\pywb\warcserver\resource\responseloader.py", line 206, in load_resource local_index_query)) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.4-py3.6.egg\pywb\warcserver\resource\resolvingloader.py", line 89, in load_headers_and_payload cdx_loader) File "C:\Tools\pywb\lib\site-packages\pywb-2.7.4-py3.6.egg\pywb\warcserver\resource\resolvingloader.py", line 217, in _load_different_url_payload raise ArchiveLoadFailed(self.MISSING_REVISIT_MSG) warcio.exceptions.ArchiveLoadFailed: Original for revisit record could not be loaded NoneType: None 127.0.0.1 - - [2024-03-27 14:15:56] "POST /republik_bastel/resource/postreq?url=https%3A%2F%2Fcdn.repub.ch%2Fs3%2Frepublik-assets%2Frepos%2Frepublik%2Farticle-30-jahre-vor-gericht%2Fimages%2Fe508169162623ea58b1069ff1e8c7128c15293b2.gif%3Fsize%3D850x567%26format%3Dauto%26resize%3D850x&closest=20240327122236&matchType=exact HTTP/1.1" 503 442 0.008018