pywb
pywb copied to clipboard
Old data & replay issue
Expected behavior
hi,
I try to set up an archive with a an old (1996-2001) archive data collection (from IA), but got errors like this:
{'args': {'coll': 'my-web-archive', 'type': 'replay', 'metadata': {}}, 'error': '{"message": "pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz: [Errno 2] No such file or directory: \'/usr/lib/python3.8/collections/my-web-archive/archive/pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz\'", "errors": {"WARCPathLoader": "pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz: [Errno 2] No such file or directory: \'/usr/lib/python3.8/collections/my-web-archive/archive/pl-2001-EXTRACTION-20200922232618-00110-00119-ARC_arc/pl-2001-EXTRACTION-20200922232618-00110-ARC.arc.gz\'"}}'}
in all URLs requested there is the same error, CDX indexing also makes some errors:
mw@webarch:~$ wb-manager cdx-convert collections/my-web-archive/indexes/
Convert 38 index files? (y/n)y
Converting collections/my-web-archive/indexes/pl-2001-EXTRACTION-20200922232618-00040-00049-ARC_arc.utf8.cdx -> collections/my-web-archive/indexes/pl-2001-EXTRACTION-20200922232618-00040-00049-ARC_arc.utf8.cdxj
Error: Invalid Url: http://www.amd.pl:8021349/21349d.html
With the original CDX files I can see the search results, but when I want to see the replay copy, I get en error.
I've tried to archive some current pages and made a test archive with the new WARC files and everything is working - so the pywb setup should be ok. Should I prepare the original files in some way?
Hi @mw0000,
Do you have a little more information on the paths you have set up the collections on the filesystem? It looks like PyWB is configured to look in /usr/lib/python3.8/collections/...
whilst your collection is in you home directory ~/collections/...
from you output above.
If you want to convert the ARC files to WARC's it's a simple process. They can be converted with warcio.
pip install warcio
warcio recompress <source.arc.gz> <destination.warc.gz>