TextBlob icon indicating copy to clipboard operation
TextBlob copied to clipboard

Sentence Boundary

Open smalldatascaled opened this issue 8 years ago • 1 comments

Am struggling with sentence boundaries in textblob. Have put a bunch of examples below where the sentences are incorrectly classified - some of the other text systems do better than textblob, none of them are very good.

What is the best way to look at fixing this (happy to help)? Should I be working on punk / NLTK and then letting the changes filter through to textblob or is this something that can be done at the textblob level?

Thanks

import textblob as tb

sampletext = """Lets confuse the sentence finder in 2016.We will do this by not leaving a gap between sentences in 2016. If I leave a gap its OK in 2016.
If I have a new line its ok as well
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)

sampletext = """Lets confuse the sentence finder with headings

Even If I Capitalise Differently

OR IF I USE ALL CAPS







Or if I leave heaps of lines

It thinks that its all one sentence
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)



sampletext = """Lets see if we can confuse the sentence finder:

     *  with bullets

     * Should be OK if we have a full stop.

     * Or a question mark ?

     * but what if we leave off punctuation

     * That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)

sampletext = """Lets see if we can confuse the sentence finder:

     i)  with bullets

     ii) Should be OK if we have a full stop.

     iii) Or a question mark ?

     iv) but what if we leave off punctuation

     v) That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)

sampletext = """Lets see if we can confuse the sentence finder:

     i.  with bullets

     ii. Should be OK if we have a full stop.

     iii. Or a question mark ?

     iv. but what if we leave off punctuation

     v. That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)
print ('-'*70)


sampletext = """Lets see if we can confuse the sentence finder:

     1.  with bullets

     2. Should be OK if we have a full stop.

     3. Or a question mark ?

     4. but what if we leave off punctuation

     5. That seems to confuse it. Having multiple sentences. In one bullet. Is not too bad.
     """

sampleblob = tb.TextBlob(sampletext)
print (sampleblob.sentences)

smalldatascaled avatar May 27 '16 01:05 smalldatascaled

I get you but TextBlob directly uses the standard NLTK combination (TreeBank + PunktSentence) and you'd rather push changes there than here. A lot of what you want could be done by PunktWordTokenizer but it's been made obsolete.

ghost avatar Feb 03 '17 07:02 ghost