openwayback IndexWorker spins forever on flv record with incorrect content-type text/html

IndexWorker spins forever on (one particular) flv record with incorrect content-type text/html.

Sep 13 '14 02:09 nlevitt

Damn it, I can't attach a file to this issue. Ugh. Ok here it is hopefully https://ia902302.us.archive.org/7/items/problem_201409/problem.warc (16 mb).

To reproduce, run org.archive.wayback.resourcestore.indexer.IndexWorker with one argument, the path to that warc.

Sep 13 '14 02:09 nlevitt

I created a test case (not committed as I'm not sure of licensing of problem.warc and maybe it's a bit on the big side for a test file):

        IndexWorker iw = new IndexWorker();
        iw.setInterval(0);
        iw.init();

        CloseableIterator<CaptureSearchResult> itr = iw.indexFile("src/test/resources/problem.warc");
        CDXFormat cdxFormat = new CDXFormat(CDXFormatIndex.CDX_HEADER_MAGIC);
        Iterator<String> lines = 
            SearchResultToCDXFormatAdapter.adapt(itr, cdxFormat);
        System.out.println(CDXFormatIndex.CDX_HEADER_MAGIC);
        while(lines.hasNext()) {
            System.out.println(lines.next());
        }

Then used jstack to see what the stuck thread was up to:

"main" prio=5 tid=7fa5cb800800 nid=0x10d4e5000 runnable [10d4e3000]
   java.lang.Thread.State: RUNNABLE
    at org.htmlparser.lexer.Lexer.parseJsp(Lexer.java:1368)
    at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:359)
    at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:65)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156)
    at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
    at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:1)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorkerTest.testIndexFile(IndexWorkerTest.java:46)

It seems the HTML parser has managed to get itself into a infinite loop. TBH, I'm surprised to find that the CDX indexer is parsing the HTML at all (at least by default).

Sep 13 '14 07:09 anjackson

Looking deeper inside, we find:

        // Now the sticky part: If it looks like an HTML document, look for
        // robot meta tags:
        if(isHTML(mimeType)) {
            String fileContext = result.getFile() + ":" + result.getOffset();
            annotateHTMLContent(is, encoding, fileContext, result);
        }

So, it seems a FLV is being parsed as HTML because the Content-Type was wrong, and the HTML parser is confounded.

FWIW, when parsing HTML etc. we have found that we have to wrap any such parser in a Thread that we can kill after a time-out. So that's one option.

Sep 13 '14 07:09 anjackson

@kngenie is working on the IA fork to improve content-type detection for link rewriting during playback. Maybe that logic can be used for this, too, when it's ready.

Sep 15 '14 17:09 nlevitt

The IA content type detection that @kngenie was working on will be part of OpenWayback as of 2.1.0. The functionality introduced in: "wayback-core/src/main/java/org/archive/wayback/replay/mimetype" might be used to solve this Issue.

Feb 03 '15 10:02 RogerMathisen

openwayback openwayback copied to clipboard

IndexWorker spins forever on flv record with incorrect content-type text/html

openwayback
openwayback copied to clipboard