tellenc
tellenc copied to clipboard
A program to detect the encoding of a text file.
tellenc
Overview
Tellenc is program to detect the encoding of a text file. Its usage is very simple:
tellenc [-v] <filename>
One file name should be provided, and a ‘-v’ option can be used to make tellenc to generate verbose output, which may help the user know how it is working and provide clues about extending the program. It currently detects the following encodings:
- ASCII,
- UTF-8
- UTF-16/32 (little-endian or big-endian)
- Latin1
- Windows-1250
- Windows-1252
- CP437
- GB2312
- GBK
- Big5
- SJIS
- EUC-JP
- EUC-KR
- KOI8-R
Extending tellenc
Extending this program should be easy. Here are the steps:
- Find some text representative of the language
- Save the text in the appropriate legacy encoding
- Run tellenc with the ‘-v’ option and the text file created above
- Look into the output and choose the double-bytes that appear in high
frequency and are also unique (not already in
freq_analysis_data
in the source code) - Add the value pair
{ code, encoding_name }
tofreq_analysis_data
in the source code
You are welcome to send me patches. Be sure to send me the test text file, too.
Building tellenc
Tellenc only requires a C++98-conformant compiler, and there are no other library dependencies. Here are a few possible command lines for different compilers.
MSVC (Windows):
cl /EHsc /Ox tellenc.cpp
GCC (Linux):
g++ -O2 tellenc.cpp -o tellenc -s
Clang (Mac):
clang++ -O2 tellenc.cpp -o tellenc
Previously I could get a very small executable with MSVC 6 + STLport 4.5.1:
cl /Ox /GX /Gr /G6 /MD /D_STLP_NO_IOSTREAMS tellenc.cpp /link /opt:nowin98
However, MSVC 6 is just too obsolete, and it does not accept the UTF-8 BOM character. I no longer maintain this build environment.
I can still get a quite small Windows executable with MSVC 7.1 + STLport 5.1.0 (size is less than half that of the executable generated by a more modern compiler, if the result only depends on system DLLs):
cl /Ox /GX /Gr /G7 /D_STLP_NO_IOSTREAMS tellenc.cpp /link /opt:nowin98
It probably does not matter, unless you like small sizes very much. :-)