heritrix3
heritrix3 copied to clipboard
Bug in non-fatal-error log
While resolving #158 I noticed that the resulting entry in the non-fatal-error log had redundant stacktraces. I.e. one exception triggered the following:
2016-05-02T12:58:56.239Z 401 4758 http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu - - text/html #001 20160502125856064+163 sha1:KKGWJBFE2H4XPVTWXRNIMTCEQTJ4N76N - -
java.lang.IllegalStateException: Missing auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu
at org.archive.modules.fetcher.FetchHTTP.extractChallenges(FetchHTTP.java:884)
at org.archive.modules.fetcher.FetchHTTP.handle401(FetchHTTP.java:802)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:743)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
java.lang.IllegalStateException: Missing auth challenge headers for uri with response status 401: http://aktravel.is/en/fundir-og-radstefnur/framkvaemd-radstefnu
at org.archive.modules.fetcher.FetchHTTP.extractChallenges(FetchHTTP.java:884)
at org.archive.modules.fetcher.FetchHTTP.handle401(FetchHTTP.java:802)
at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:743)
at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
at org.archive.modules.Processor.process(Processor.java:142)
at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
Probably a bug somewhere in
Hmm, looking into this a bit closer, this may actually be a bug in the webarchive-commons
org.archive.io.GenerationFileHandler.publish()
((Preformatter)f).preformat(record);
super.publish(record);
Seems that both lines ultimately invoke NonFatalErrorFormatter.format() but publish() should be using the string prepared by preformat().
Also, I've confirmed that this bug was introduced between 3.0.0 and 3.1.0-RC1. It first shows up in our 2011-02 crawl which is when we switched from 3.0.0 to 3.1.0-RC1.