boilerpipe
boilerpipe copied to clipboard
Extract article HTML from given HTML source?
Hi,
I know that the HTMLHighlighter can extract article HTML but only from
TextDocument and URL.
I use HttpClient to retrieve HTML but I don't know how to contruct the
TextDocument or other ways to extract the article HTML from it.
Please help!
Original issue reported on code.google.com by [email protected]
on 30 Nov 2012 at 8:44
here is what i did
ArticleExtractor EXTRACTOR = ArticleExtractor.getInstance();
HTMLHighlighter HH = HTMLHighlighter.newExtractingInstance();
InputSource inputSource = new InputSource(new StringInputStream(html));
TextDocument htmlDoc = new BoilerpipeSAXInput(inputSource).getTextDocument();
EXTRACTOR.process(htmlDoc);
html = HH.process(htmlDoc, html);
Original comment by [email protected]
on 28 Jun 2013 at 8:21