Mikhail Korobov
Mikhail Korobov
Czech-Latin2-err and Czech-Latin2 are also the same.
А было бы интересно словарь odict в pymorphy2 подключить вместо opencorpora и посмотреть, что получится, сравнить :) Да и ударения тоже могут быть полезными.
Hi @DeaconDesperado , I don't have experience with nltk_contrib and textanalyzer.py, but "unicode" and "utf-8 text" are in some sense antonyms, not synonyms, because "utf8-encoded" means "binary". So what is...
I think that analyzer should only accept unicode text and leave the task of decoding to user, and that almost every ".encode" / ".decode" is in incorrect place now :)...
@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are...
There is a stalled PR to address that: https://github.com/scrapy/w3lib/pull/25
@odinplus I wonder how this site works with Firefox, as according to @redapple's test Firefox doesn't encode `|` as well.
It seems there is still no consensus between browsers how to handle different characters in URL path (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1064700). This means that a website which works in one browser may...
In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue...
@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly. Unicode _input_ issues are fixed by...