magika Magika to detect text file encoding

Hi, I was wondering if magika could be used in the future to detect the encoding of text files (utf-8, ascii, iso-885961, cp-1252, etc...) as this is not an easy task. Thanks

Feb 18 '24 15:02 mikacousin

You can try this: chardet

Feb 19 '24 11:02 Byxs20

Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.

Feb 19 '24 11:02 mikacousin

Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.

Thanks for sharing. I'll try it sometime.

Feb 19 '24 14:02 Byxs20

Thanks all! We would be interested in checking this out, but at first we would try to understand whether existing approaches have problems (and which ones).

I'm wondering:

What is the approach used by existing tools? heuristics? trial an error?
Does anyone have instances of misdetections or any intuition why existing approaches could have problems?
Are existing problems more related to efficiency/performance rather than accuracy?

Feb 19 '24 18:02 reyammer

Hello @mikacousin!

The short answer is no, we're not extending Magika's scope beyond file type for now. That's already a large enough scope for this project at the moment, and we prefer to do that job really well before considering any extension.

Long answer: The Magika approach is pretty much agnostic to content types. That means that, as long as you can collect a large labeled dataset of text files with various encoding, you could train a Magika model on it to detect encodings. I'm expecting such a model to be at 99.xx% precision and recall, given what we've seen with Magika so far. That said, the top-of-the-line stats that charset_normalizer reports are already looking great, so any improvement edge is likely to be marginal.

Package Accuracy Mean per file (ms) File per sec (est)

charset-normalizer 98 % 10 ms 100 file/sec

Package	Accuracy	Mean per file (ms)	File per sec (est)
charset-normalizer	98 %	10 ms	100 file/sec

An ML approach could potentially have two edges on solutions like charset_normalizer:

faster batch speed (through use of CPU features like this)
address the following stated limitation of charset_normalizer:

Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))

That's because ML is generally context-aware - when properly trained, it could recognise the HTML structure (or other common formats), and give more weight to the Turkish content.

Feb 19 '24 18:02 invernizzi

Thank you for your detailed answer. Shall I close this issue?

Feb 20 '24 12:02 mikacousin

Closing this for now, but tagged it so that we keep this into account for the future if this becomes relevant. Thanks!

Feb 20 '24 12:02 reyammer