boilerpipe
boilerpipe copied to clipboard
Support HTML5 elements
Now that HTML5 becomes more pervasive on the web, it might be worth considering
additional parsing support in places, one example being the recently added
image extractor. HTML5 includes <figure> and <figcaption> for adding semantics
to images, especially the figcaption element is of interest since the text
could be used to determine image relevancy in relation to the extracted
document text.
Original issue reported on code.google.com by [email protected]
on 18 Oct 2011 at 9:03
NAV, FOOTER, and HEADER should also help eliminate chunks of unwanted text.
Original comment by [email protected]
on 15 Mar 2012 at 8:13
Sample HTML5 article with appropriate use of some of the tags mentioned above:
http://www.forbes.com/sites/forbestravelguide/2012/01/19/the-best-international-
airports-for-layovers/
Original comment by [email protected]
on 22 Mar 2012 at 8:53