Elia Robyn Lake (Robyn Speer)

Results 62 comments of Elia Robyn Lake (Robyn Speer)

Oh I get it: if we leave out the dictionary, MeCab will use whatever Japanese dictionary it prefers. Unfortunately, on Ubuntu, the default Japanese dictionary is mecab-jumandic-utf8, not mecab-ipadic-utf8. So...

Hello polm, thanks for the info here. The release of version 2.4.2 is, as you've noticed, intended to resolve the long-standing difficulty of dealing with mecab's dependencies. This bug can...

That one's not an issue. Beneath the mojibake, that's exactly what the text says. `�` in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER....

There are several open issues that are really the same thing. I'm merging them all into this issue.

The examples are helpful! I can use them as test cases.

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with. I should, however, look into "the infamous CP850"...

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3): ``` >>>...

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences...

@jpluimers I can't tell what the Unicode issue you're linking is -- what's the mojibaked text, and can you tell which encodings were mixed up?

trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?