boilerpipe icon indicating copy to clipboard operation
boilerpipe copied to clipboard

Support HTML5 elements

Open GoogleCodeExporter opened this issue 9 years ago • 2 comments

Now that HTML5 becomes more pervasive on the web, it might be worth considering 
additional parsing support in places, one example being the recently added 
image extractor. HTML5 includes <figure> and <figcaption> for adding semantics 
to images, especially the figcaption element is of interest since the text 
could be used to determine image relevancy in relation to the extracted 
document text.

Original issue reported on code.google.com by [email protected] on 18 Oct 2011 at 9:03

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter

NAV, FOOTER, and HEADER should also help eliminate chunks of unwanted text.

Original comment by [email protected] on 15 Mar 2012 at 8:13

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter

Sample HTML5 article with appropriate use of some of the tags mentioned above:
http://www.forbes.com/sites/forbestravelguide/2012/01/19/the-best-international-
airports-for-layovers/

Original comment by [email protected] on 22 Mar 2012 at 8:53

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter