Parse Issue: some sites use <br> for newlines but the parse doesn't add spaces between sentences
Issue by HodorTheCoder
Tue Sep 4 21:25:06 2018
Originally opened as https://github.com/codelucas/newspaper/issues/621
For example, from this article: https://abc7ny.com/15-year-old-girl-dies-after-5-story-fall-from-fire-escape/4134009/
A snippet from the HTML from the above article using <br><br> to seperate paragraphs:
A teenage girl died after she fell from the fire escape of an apartment building in Lower Manhattan late Sunday.<br><br>Police say 15-year-old Imogen Roche was attending a party inside an apartment building on Reade Street in Tribeca.<br> <div class="adRectangle-pos-small-inline" data-set="adAppend"></div> <br>It appears she left her cell phone in a room that was locked just before 11 p.m.<br><br>Authorities say Imogen went onto the fire escape, attempting to reenter the apartment by going in another window, when she lost her balance and fell.<br><br>
This translates to the following text when you parse it:
A teenage girl died after she fell from the fire escape of an apartment building in Lower Manhattan late Sunday.Police say 15-year-old Imogen Roche was attending a party inside an apartment building on Reade Street in Tribeca.It appears she left her cell phone in a room that was locked just before 11 p.m.Authorities say Imogen went onto the fire escape, attempting to reenter the apartment by going in another window, when she lost her balance and fell.
You can see that the sentenes collide without spaces in the parsed output where the <br><br> show up. Sunday.Police, Tribeca.It, p.m.Authorities, etc.
This is due to poor markup on the news site, for sure, but it happens a ton actually. Any chance we could check for <br>'s after a period or sentence ending punctuation and add a space in the parsed output to distinguish sentences? It's hard for my NLP post processor to distinguish, for example, "Sunday.Police" as anything but one word because there are no spaces. Thanks. I would think this is an easy fix, right?
edit: title/HTML formatting
Comment by andho
Thu Nov 1 15:52:21 2018
Just for clarity <br> is the correct way for line break in HTML5. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/br
Comment by kut
Fri Mar 29 16:54:30 2019
Running into same thing here - also means we lose the paragraph information...