TextBlob
TextBlob copied to clipboard
Sentence Boundary
Am struggling with sentence boundaries in textblob. Have put a bunch of examples below where the sentences are incorrectly classified - some of the other text systems do better than textblob, none of them are very good.
What is the best way to look at fixing this (happy to help)? Should I be working on punk / NLTK and then letting the changes filter through to textblob or is this something that can be done at the textblob level?
Thanks
import textblob as tb
sampletext = """Lets confuse the sentence finder in 2016.We will do this by not leaving a gap between sentences in 2016. If I leave a gap its OK in 2016.
If I have a new line its ok as well
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)
sampletext = """Lets confuse the sentence finder with headings
Even If I Capitalise Differently
OR IF I USE ALL CAPS
Or if I leave heaps of lines
It thinks that its all one sentence
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)
sampletext = """Lets see if we can confuse the sentence finder:
* with bullets
* Should be OK if we have a full stop.
* Or a question mark ?
* but what if we leave off punctuation
* That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)
sampletext = """Lets see if we can confuse the sentence finder:
i) with bullets
ii) Should be OK if we have a full stop.
iii) Or a question mark ?
iv) but what if we leave off punctuation
v) That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)
sampletext = """Lets see if we can confuse the sentence finder:
i. with bullets
ii. Should be OK if we have a full stop.
iii. Or a question mark ?
iv. but what if we leave off punctuation
v. That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)
sampletext = """Lets see if we can confuse the sentence finder:
1. with bullets
2. Should be OK if we have a full stop.
3. Or a question mark ?
4. but what if we leave off punctuation
5. That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
"""
sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
I get you but TextBlob directly uses the standard NLTK combination (TreeBank + PunktSentence) and you'd rather push changes there than here. A lot of what you want could be done by PunktWordTokenizer but it's been made obsolete.