Shutdown hangs when ExtractorHTML is stuck on big gnarly HTML
We've found we can't shut down H3 cleanly because it's getting stuck on a very large and poorly-formed HTML file.
Anything we can do to help this at least shut down cleanly?
[ToeThread #93: http://s152224197.websitehome.co.uk/other_secured_loans.php
CrawlURI http://s152224197.websitehome.co.uk/other_secured_loans.php ILLLL http://s152224197.websitehome.co.uk/mortgage_guides.php 0 attempts
in processor: extractorHtml
ACTIVE for 2h24m30s338ms
step: ABOUT_TO_BEGIN_PROCESSOR for 2h24m10s571ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
org.archive.util.InterruptibleCharSequence.charAt(InterruptibleCharSequence.java:41)
java.util.regex.Pattern$SliceI.match(Pattern.java:3890)
java.util.regex.Pattern$Curly.match1(Pattern.java:4185)
java.util.regex.Pattern$Curly.match(Pattern.java:4134)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
java.util.regex.Pattern$Curly.match2(Pattern.java:4209)
java.util.regex.Pattern$Curly.match(Pattern.java:4136)
java.util.regex.Pattern$SliceI.match(Pattern.java:3895)
java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
java.util.regex.Pattern$Branch.match(Pattern.java:4502)
java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
java.util.regex.Pattern$Start.match(Pattern.java:3408)
java.util.regex.Matcher.search(Matcher.java:1199)
java.util.regex.Matcher.find(Matcher.java:592)
org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:810)
org.archive.modules.extractor.ExtractorHTML.innerExtract(ExtractorHTML.java:743)
org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
org.archive.modules.Processor.innerProcessResult(Processor.java:175)
org.archive.modules.Processor.process(Processor.java:142)
org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
]
Invoking that Toethread's kill() method should abort it fairly cleanly.
Edit: Best invoked via killThread(int threadNumber, boolean replace).
InterruptibleCharSequence should respond to the interrupt by raising a RuntimeException, ending the regex work.
There used to be a GUI option for this in H1. But I think in H3 you need to use the scripting console.
Is killThread available from the H3 console? e.g. is there a way to script a process to kill all ToeThreads?
In the H3 console, I run this to get a report on toe threads:
job.crawlController.ToePool.reportTo(rawOut);
and then kill the thread I want with:
tpool = job.crawlController.ToePool;
tpool.killThread(10, true)
in case that is helpful...
@ldko @kris-sigur Thank you! That gave me the information I needed to come up with this nasty little script...
tpool = job.crawlController.toePool;
tpool.toes.each{
if( it != null) {
rawOut.println("Killing toe thread " + it.serialNumber)
tpool.killThread(it.serialNumber, false)
}
}
...which is just what I need! Thanks again.