cdxj-indexer
cdxj-indexer copied to clipboard
Recompress and Re-indexing Errors
We've run into two issues while trying to recompress and re-index some of our older ARCs.
1): When running warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz we get:
IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC
Could anyone elaborate on what's going on here/suggest possible work around?
2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 403: ordinal not in range(128)
We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.