boilerpipe
boilerpipe copied to clipboard
Better support for non-english pages
I'm looking for a solution to parse pages that are non-english, which seems to
give varying results with Boilerpipe. Here are a couple of examples where
boilerpipe misses the main portion of text (tested with
http://boilerpipe-web.appspot.com/ - 2011-01-06):
*
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik
- picks up some teasers instead
*
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning -
picks up the comment section
I also see minor artifacts from non-content sections throughout the extracted
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra"
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto -
Misses main header and teaser
I know it's hard to get all the above URL:s right without site-specific code,
but I also know it's possible. I've run all of the URL:s above through
readability.js, and it parses all of them without any artifacts. Maybe it's
readabilities reliance on class names (which generally is in english even on
foreign language sites) that makes it cope better. Problem is, readability.js
is a mess to run server-side, and has not undergone the rigorous testing
boilerpipe has, so I would much rather see boilerpip succeed that switch to
readability.js.
Thanks for your hard work.
Original issue reported on code.google.com by EmilStenstrom
on 6 Jan 2011 at 2:43
[deleted comment]
Here's an update from 2011-12-08 on the above URL:s, using the web version of
boilerpipe:
*
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik
- Misses the header altogether (dn.se has had a new design since then...)
*
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - picks up some teasers
instead of main text.
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - One
teaser, and various text from popups
Minor artifacts:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - - "Skriv ut" is
a link to print the article. "Bildmaterial" is a header from the sidebar".
"Dela" at the bottom is from the sharing feature
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - This one does no
longer have any artifacts, well done!
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto -
Misses main header and teaser
I don't know what magic Readability uses, but all of the above urls works
perfectly with Readability.
Original comment by EmilStenstrom
on 8 Dec 2011 at 9:08
http://www.anspress.com/index.php?a=2&cid=48&lng=az&nid=270848
Original comment by [email protected]
on 13 May 2014 at 1:44