Feature request: support for schema.org/Article in ArticleExtractor

Open EgbertW opened this issue 10 years ago • 0 comments

(updated)

HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.

An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article

One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.

In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.

Jul 13 '15 11:07 EgbertW