Elia Robyn Lake (Robyn Speer) comments

Results 62 comments of


                                            Elia Robyn Lake (Robyn Speer)

No `mecab-ipadic-utf8` on centos 7, how can I use wordfreq on Japanese in this case?

Oh I get it: if we leave out the dictionary, MeCab will use whatever Japanese dictionary it prefers. Unfortunately, on Ubuntu, the default Japanese dictionary is mecab-jumandic-utf8, not mecab-ipadic-utf8. So...

No `mecab-ipadic-utf8` on centos 7, how can I use wordfreq on Japanese in this case?

Hello polm, thanks for the info here. The release of version 2.4.2 is, as you've noticed, intended to resolve the long-standing difficulty of dealing with mecab's dependencies. This bug can...

Feature: detect mixups between two single-byte encodings

That one's not an issue. Beneath the mojibake, that's exactly what the text says. `ï¿½` in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER....

Feature: detect mixups between two single-byte encodings

There are several open issues that are really the same thing. I'm merging them all into this issue.

Feature: detect mixups between two single-byte encodings

The examples are helpful! I can use them as test cases.

Feature: detect mixups between two single-byte encodings

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with. I should, however, look into "the infamous CP850"...

Feature: detect mixups between two single-byte encodings

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3): ``` >>>...

Feature: detect mixups between two single-byte encodings

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences...

Feature: detect mixups between two single-byte encodings

@jpluimers I can't tell what the Unicode issue you're linking is -- what's the mojibaked text, and can you tell which encodings were mixed up?

Feature: detect mixups between two single-byte encodings

trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?