snacktory icon indicating copy to clipboard operation
snacktory copied to clipboard

Provide optional extraction directives

Open bejean opened this issue 12 years ago • 3 comments

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor(); extractor.setTextSelector("div.article_content"); extractor.setTitleSelector("h2", "first"); String dateRegEx = "xxxx"; extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData); text = res.getText(); title = res.getTitle(); date = res.getDate();

bejean avatar Oct 14 '12 10:10 bejean

Hmmh I don't find this solution that useful as one could simply use jsoup directly for those failing sites. Also I would rather adapt the core to include the failing site. Let me think about it.

karussell avatar Oct 14 '12 12:10 karussell

Provide a scope to snacktory for the text extraction means to use the snacktory algorithm within this scope. We still need snacktory algorithm.

bejean avatar Oct 14 '12 13:10 bejean

I see what you mean!

karussell avatar Oct 15 '12 06:10 karussell