PyTeaser icon indicating copy to clipboard operation
PyTeaser copied to clipboard

UnicodeEncodeError in split_sentences

Open vetal4444 opened this issue 10 years ago • 11 comments

  s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]

E UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 85: ordinal not in range(128)

vetal4444 avatar Jan 13 '15 16:01 vetal4444

Use latest version from pip (pyteaser==1.0)

vetal4444 avatar Jan 13 '15 16:01 vetal4444

This was raised before and closed without being fixed. :( https://github.com/xiaoxu193/PyTeaser/issues/33

grimpunch avatar Feb 19 '15 10:02 grimpunch

@grimpunch Thought this was fixed with https://github.com/xiaoxu193/PyTeaser/pull/34

Apparently not. Will look into this personally

xiaoxu193 avatar Feb 19 '15 14:02 xiaoxu193

It seems there are old version in pip. Code from master have not this error.

vetal4444 avatar Feb 19 '15 14:02 vetal4444

My apologies, vetal4444 is correct. I'm using master now. Pip definitely has an old version

grimpunch avatar Feb 20 '15 16:02 grimpunch

@vetal4444 @grimpunch thank you guys for spotting the error!

  • Pip has been updated: https://pypi.python.org/pypi/pyteaser
  • README has been updated to reflect the change. https://github.com/xiaoxu193/PyTeaser/pull/43

xiaoxu193 avatar Mar 16 '15 05:03 xiaoxu193

I am still getting the same error. I did tried encode to utf-8 etc. not working :( .

harikt avatar Mar 23 '15 14:03 harikt

Can you post the link that you tried to run the algorithm on?

xiaoxu193 avatar Mar 23 '15 14:03 xiaoxu193

Sorry that I didn't thanked you for the awesome work you have done. Thank you dude.

Coming back to the problem :

Strange thing is I have installed pytease via pip and have updated via pip install -U .

An earlier version was using pyteaser.py file which is just copied to my folder. That worked from there. But only the pip installation is failing . I am also new to Python. My background is PHP.

from goose import Goose
>>> from pyteaser import Summarize
>>> g = Goose()
>>> page_url = "http://nikic.github.com/2012/06/29/PHP-solves-problems-Oh-and-you-can-program-with-it-too.html"
>>> try:
...     page = g.extract(page_url)
...     description = page.cleaned_text.encode('utf-8')
...     title = page.title
...     summarylist = Summarize(title, description)            
... except:
...     # Exception
...     print "Error occured in summary"
...     raise
... 
Error occured in summary
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 85, in Summarize
    sentences = split_sentences(text)
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 209, in split_sentences
    s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

harikt avatar Mar 23 '15 15:03 harikt

Hi @xiaoxu193 ,

I have a question what about keeping a try catch ?

    another = ''
    for y in s_iter:
        try:
             another += ''.join(map(unicode,y)).lstrip()
        except:
            print "some way to catch"
    s_iter = [another]
    s_iter.append(sentences[-1])
    return s_iter

This is a pseudo code though which didn't worked :( .

Just my thought.

Thank you.

harikt avatar Mar 31 '15 15:03 harikt

The problem occurring is with split(u'(?<![A-ZА-ЯЁ])([.!?]"?)(?=\s+\"?[A-ZА-ЯЁ])', text, maxsplit=0, flags=REGEX_UNICODE)

harikt avatar Apr 02 '15 10:04 harikt