openwayback
openwayback copied to clipboard
Problem with arcIndexer
Hi all,
Arc indexer send an exception like Created (escaped) uuri > 2083 and the indexation process stop it. How can I solve this problem?
Thanks all
Can you provide more information about this issue? Sample file and command line to replicate the issue? Thanks!
Sorry but I can not send you the arc file because I do not have access to it. It is something strange because some links of web that I harvested has more than 2083 characters and the process not finished but throw an exception.
Managed to replicate this using the latest OpenWayback and the cdx-indexer script and this mocked-up ARC. Not sure how to attach files to GitHub(?). In any case, the error is:
org.apache.commons.httpclient.URIException: URI length > 2083
This seems to be getting thrown by:
org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)
It that case what can i do? remove arc?
I've just re-tested using a different ARC (updated the link above) with a record after the erroring one.
Although it throws the exception it does continue to index records afterwards - the indexing process doesn't stop. Are you finding otherwise? What version of OpenWayback are you using?
I am using 1.2.1 version. I detect that is stopping because I call arc-indexer from JAVA class using a process.
If you are using the code from Java, you will need to catch any runtime Exceptions thrown during the iteration over the records, so that you can recover and move on to the next record.
Thanks for the info. I think we actually encounter the same problem sometimes. Is this something that can be recovered from? Perhaps the error should be caught by the ArcIndexer.