xnedit icon indicating copy to clipboard operation
xnedit copied to clipboard

xnedit does not automatically recognize charset of a file

Open Robo8920 opened this issue 1 year ago • 2 comments

When opening a file I always have to manually select the correct charset. I have the impression that "nedit" did better on this.

Robo8920 avatar May 06 '24 14:05 Robo8920

Hello, I have some questions:

  1. what is the encoding of the file?
  2. what is your locale?
  3. can you provide a test file, that is not detected correctly?

Btw, nedit doesn't try to detect anything, because it doesn't support multiple charsets and only works, if you are using a non-UTF8 locale.

unixwork avatar May 06 '24 15:05 unixwork

I my case it's mainly UTF-8 and LATIN-1 (ISO8859-1) BTW great work for your side to provide XNedit

Robo8920 avatar May 06 '24 16:05 Robo8920

Was this completed, or discarded? I realized that if I open a 8859-1 file, it detects errors, and I need to manually tell xnedit that it's 8859-1 and click on Reload.

cblc avatar Nov 15 '24 19:11 cblc

I got no feedback or any instructions how to reproduce it.

There is a function that tries to detect the encoding, however it does not always work.

unixwork avatar Nov 15 '24 20:11 unixwork

You are right, it's difficult to reproduce, I cannot make it fail now, but it sometimes fails.

cblc avatar Nov 15 '24 20:11 cblc

One thing, though, is that for some reason the autodetection seems to prefer 8859-15 over 8859-1. I'd welcome some sort of user preference for changing your "preferred encoding when several ones match", because all my old files are 8859-1 and I have to manually change them when opening them because they are detected as 8859-15

cblc avatar Nov 30 '24 07:11 cblc

The encoding is chosen based on the country code from the locale environment variable.

I have now added a new nedit.fallbackCharset preference to the nedit.rc file. The default is locale, but you can set it to any encoding.

unixwork avatar Nov 30 '24 09:11 unixwork

Thank you very much! I'm trying to understand how it works, because my locale is set to es_ES.UTF-8 so I don't know why it falls back to 8859-15 because my locale is UTF-8. Looking at the DetectEncoding function in file.c, it seems to fallback to the default encoding if the file doesn't seem to be Unicode, so I don't know why it goes to any of the 8859 flavours.

cblc avatar Dec 01 '24 18:12 cblc

In file.c the function GetFileContent() contains the line 1017:

encoding = GetPrefDefaultCharset();

This will get the default encoding from the preferences. The preferences is usually set to 'locale', in which case xnedit gets the charset from the locale, in your case UTF-8.

Later DetectEncoding() will be called, with the def parameter set to "UTF-8". However if the file is not UTF-8, there will probably many encoding errors.

if(utf8Err == 0 || utf8Mb - utf8Err > 2) {
    return "UTF-8";
}

utf8Mb is the number of valid utf-8 multibyte characters, utf8Err is the number of invalid characters.

After that, GetDefaultEncoding() is called, however the function name is probably misleading. The function gets the locale variable, but without the UTF-8 part, just es_ES in your case, and compares this with the locales array, that contains the fallback encoding for this locale. In file.c line 91 is a big array, that contains all the defaults.

unixwork avatar Dec 01 '24 18:12 unixwork

Aah! That clears it! thanks a lot!!

cblc avatar Dec 01 '24 21:12 cblc