heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Shutdown hangs when ExtractorHTML is stuck on big gnarly HTML

Open anjackson opened this issue 9 years ago • 4 comments

We've found we can't shut down H3 cleanly because it's getting stuck on a very large and poorly-formed HTML file.

Anything we can do to help this at least shut down cleanly?

[ToeThread #93: http://s152224197.websitehome.co.uk/other_secured_loans.php
 CrawlURI http://s152224197.websitehome.co.uk/other_secured_loans.php ILLLL http://s152224197.websitehome.co.uk/mortgage_guides.php    0 attempts
    in processor: extractorHtml
    ACTIVE for 2h24m30s338ms
    step: ABOUT_TO_BEGIN_PROCESSOR for 2h24m10s571ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    org.archive.util.InterruptibleCharSequence.charAt(InterruptibleCharSequence.java:41)
    java.util.regex.Pattern$SliceI.match(Pattern.java:3890)
    java.util.regex.Pattern$Curly.match1(Pattern.java:4185)
    java.util.regex.Pattern$Curly.match(Pattern.java:4134)
    java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    java.util.regex.Pattern$Curly.match2(Pattern.java:4209)
    java.util.regex.Pattern$Curly.match(Pattern.java:4136)
    java.util.regex.Pattern$SliceI.match(Pattern.java:3895)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$Branch.match(Pattern.java:4502)
    java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    java.util.regex.Pattern$Start.match(Pattern.java:3408)
    java.util.regex.Matcher.search(Matcher.java:1199)
    java.util.regex.Matcher.find(Matcher.java:592)
    org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:810)
    org.archive.modules.extractor.ExtractorHTML.innerExtract(ExtractorHTML.java:743)
    org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
    org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
    org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
]

anjackson avatar Nov 02 '16 13:11 anjackson

Invoking that Toethread's kill() method should abort it fairly cleanly.

Edit: Best invoked via killThread(int threadNumber, boolean replace).

InterruptibleCharSequence should respond to the interrupt by raising a RuntimeException, ending the regex work.

There used to be a GUI option for this in H1. But I think in H3 you need to use the scripting console.

kris-sigur avatar Nov 02 '16 14:11 kris-sigur

Is killThread available from the H3 console? e.g. is there a way to script a process to kill all ToeThreads?

anjackson avatar Nov 08 '16 11:11 anjackson

In the H3 console, I run this to get a report on toe threads:

job.crawlController.ToePool.reportTo(rawOut);

and then kill the thread I want with:

tpool = job.crawlController.ToePool;
tpool.killThread(10, true)

in case that is helpful...

ldko avatar Nov 08 '16 16:11 ldko

@ldko @kris-sigur Thank you! That gave me the information I needed to come up with this nasty little script...

tpool = job.crawlController.toePool;

tpool.toes.each{
    if( it != null) {
        rawOut.println("Killing toe thread " + it.serialNumber)
        tpool.killThread(it.serialNumber, false)
    }
}

...which is just what I need! Thanks again.

anjackson avatar Nov 08 '16 17:11 anjackson