Magika to detect text file encoding
Hi, I was wondering if magika could be used in the future to detect the encoding of text files (utf-8, ascii, iso-885961, cp-1252, etc...) as this is not an easy task. Thanks
You can try this: chardet
Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.
Thanks for the link. For the time being, I'm using charset_normalizer for this task. But I'm curious to know if another method is possible and more efficient.
Thanks for sharing. I'll try it sometime.
Thanks all! We would be interested in checking this out, but at first we would try to understand whether existing approaches have problems (and which ones).
I'm wondering:
- What is the approach used by existing tools? heuristics? trial an error?
- Does anyone have instances of misdetections or any intuition why existing approaches could have problems?
- Are existing problems more related to efficiency/performance rather than accuracy?
Hello @mikacousin!
The short answer is no, we're not extending Magika's scope beyond file type for now. That's already a large enough scope for this project at the moment, and we prefer to do that job really well before considering any extension.
Long answer: The Magika approach is pretty much agnostic to content types. That means that, as long as you can collect a large labeled dataset of text files with various encoding, you could train a Magika model on it to detect encodings. I'm expecting such a model to be at 99.xx% precision and recall, given what we've seen with Magika so far. That said, the top-of-the-line stats that charset_normalizer reports are already looking great, so any improvement edge is likely to be marginal.
Package Accuracy Mean per file (ms) File per sec (est) charset-normalizer 98 % 10 ms 100 file/sec
An ML approach could potentially have two edges on solutions like charset_normalizer:
- faster batch speed (through use of CPU features like this)
- address the following stated limitation of
charset_normalizer:
Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
That's because ML is generally context-aware - when properly trained, it could recognise the HTML structure (or other common formats), and give more weight to the Turkish content.
Thank you for your detailed answer. Shall I close this issue?
Closing this for now, but tagged it so that we keep this into account for the future if this becomes relevant. Thanks!