openwayback icon indicating copy to clipboard operation
openwayback copied to clipboard

Problem with arcIndexer

Open rruizibai opened this issue 11 years ago • 8 comments

Hi all,

Arc indexer send an exception like Created (escaped) uuri > 2083 and the indexation process stop it. How can I solve this problem?

Thanks all

rruizibai avatar Dec 16 '13 08:12 rruizibai

Can you provide more information about this issue? Sample file and command line to replicate the issue? Thanks!

egh avatar Jan 09 '14 00:01 egh

Sorry but I can not send you the arc file because I do not have access to it. It is something strange because some links of web that I harvested has more than 2083 characters and the process not finished but throw an exception.

rruizibai avatar Jan 09 '14 07:01 rruizibai

Managed to replicate this using the latest OpenWayback and the cdx-indexer script and this mocked-up ARC. Not sure how to attach files to GitHub(?). In any case, the error is:

org.apache.commons.httpclient.URIException: URI length > 2083

This seems to be getting thrown by:

org.archive.wayback.util.url.AggressiveUrlCanonicalizer.urlStringToKey(AggressiveUrlCanonicalizer.java:223)

PsypherPunk avatar Jan 09 '14 11:01 PsypherPunk

It that case what can i do? remove arc?

rruizibai avatar Jan 09 '14 11:01 rruizibai

I've just re-tested using a different ARC (updated the link above) with a record after the erroring one.

Although it throws the exception it does continue to index records afterwards - the indexing process doesn't stop. Are you finding otherwise? What version of OpenWayback are you using?

PsypherPunk avatar Jan 09 '14 13:01 PsypherPunk

I am using 1.2.1 version. I detect that is stopping because I call arc-indexer from JAVA class using a process.

rruizibai avatar Jan 09 '14 14:01 rruizibai

If you are using the code from Java, you will need to catch any runtime Exceptions thrown during the iteration over the records, so that you can recover and move on to the next record.

anjackson avatar Jan 09 '14 18:01 anjackson

Thanks for the info. I think we actually encounter the same problem sometimes. Is this something that can be recovered from? Perhaps the error should be caught by the ArcIndexer.

egh avatar Jan 09 '14 18:01 egh