PyTeaser
PyTeaser copied to clipboard
UnicodeEncodeError in split_sentences
s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]
E UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 85: ordinal not in range(128)
Use latest version from pip (pyteaser==1.0)
This was raised before and closed without being fixed. :( https://github.com/xiaoxu193/PyTeaser/issues/33
@grimpunch Thought this was fixed with https://github.com/xiaoxu193/PyTeaser/pull/34
Apparently not. Will look into this personally
It seems there are old version in pip. Code from master have not this error.
My apologies, vetal4444 is correct. I'm using master now. Pip definitely has an old version
@vetal4444 @grimpunch thank you guys for spotting the error!
- Pip has been updated: https://pypi.python.org/pypi/pyteaser
- README has been updated to reflect the change. https://github.com/xiaoxu193/PyTeaser/pull/43
I am still getting the same error. I did tried encode to utf-8 etc. not working :( .
Can you post the link that you tried to run the algorithm on?
Sorry that I didn't thanked you for the awesome work you have done. Thank you dude.
Coming back to the problem :
Strange thing is I have installed pytease via pip
and have updated via pip install -U
.
An earlier version was using pyteaser.py file which is just copied to my folder. That worked from there. But only the pip installation is failing . I am also new to Python. My background is PHP.
from goose import Goose
>>> from pyteaser import Summarize
>>> g = Goose()
>>> page_url = "http://nikic.github.com/2012/06/29/PHP-solves-problems-Oh-and-you-can-program-with-it-too.html"
>>> try:
... page = g.extract(page_url)
... description = page.cleaned_text.encode('utf-8')
... title = page.title
... summarylist = Summarize(title, description)
... except:
... # Exception
... print "Error occured in summary"
... raise
...
Error occured in summary
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 85, in Summarize
sentences = split_sentences(text)
File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 209, in split_sentences
s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
Hi @xiaoxu193 ,
I have a question what about keeping a try catch ?
another = ''
for y in s_iter:
try:
another += ''.join(map(unicode,y)).lstrip()
except:
print "some way to catch"
s_iter = [another]
s_iter.append(sentences[-1])
return s_iter
This is a pseudo code though which didn't worked :( .
Just my thought.
Thank you.
The problem occurring is with split(u'(?<![A-ZА-ЯЁ])([.!?]"?)(?=\s+\"?[A-ZА-ЯЁ])', text, maxsplit=0, flags=REGEX_UNICODE)