Mikhail Korobov comments

Results 479 comments of


                                            Mikhail Korobov

remove Russian_Russky-UTF8~ from corpora/udhr.zip

Czech-Latin2-err and Czech-Latin2 are also the same.

Синхронизация с odict.ru

А было бы интересно словарь odict в pymorphy2 подключить вместо opencorpora и посмотреть, что получится, сравнить :) Да и ударения тоже могут быть полезными.

Passing Unicode directly raises TypeError in textanalyzer

Hi @DeaconDesperado , I don't have experience with nltk_contrib and textanalyzer.py, but "unicode" and "utf-8 text" are in some sense antonyms, not synonyms, because "utf8-encoded" means "binary". So what is...

Passing Unicode directly raises TypeError in textanalyzer

I think that analyzer should only accept unicode text and leave the task of decoding to user, and that almost every ".encode" / ".decode" is in incorrect place now :)...

It's not a good idead to parse HTML text using regular expressions

@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are...

Pipe symbol ("|") is not percent encoded

There is a stalled PR to address that: https://github.com/scrapy/w3lib/pull/25

Pipe symbol ("|") is not percent encoded

@odinplus I wonder how this site works with Firefox, as according to @redapple's test Firefox doesn't encode `|` as well.

Pipe symbol ("|") is not percent encoded

It seems there is still no consensus between browsers how to handle different characters in URL path (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1064700). This means that a website which works in one browser may...

support CJK string annotation; print readably CJK string in scrapely.tool's output

In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue...

support CJK string annotation; print readably CJK string in scrapely.tool's output

@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly. Unicode _input_ issues are fixed by...