[BUG] Wrong file length in the statusbar
Description of the Issue
After selecting proper 1-byte codepage in the menu the internal representation of data becomes unicoded. So the displayed length now is x2 times greater than it is.
It also breaks the HEX plugin, now it displays bullshit instead of raw file data.
Related: #16468 #15919
Steps To Reproduce
- Open 1-byte txt
- Select another 1-byte codepage.
- See wrong length - now it's unicoded internally
Current Behavior
Length doubled
Expected Behavior
Correct length
Debug Information
Notepad++ v8.8.1 (64-bit)
Build time : May 3 2025 - 18:41:09
Scintilla/Lexilla included : 5.5.6/5.4.4
Boost Regex included : 1_85
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : "C:\Program Files\Notepad++\change.log"
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : OFF
Placeholders : OFF
Scintilla Rendering Mode : SC_TECHNOLOGY_DIRECTWRITE (1)
Multi-instance Mode : monoInst
File Status Auto-Detection : cdEnabledNew (for current file/tab only)
Dark Mode : OFF
OS Name : Windows 7 Professional (64-bit)
OS Build : 7601.0
Current ANSI codepage : 1251
Plugins :
BigFiles (0.1.3)
ComparePlugin (2.0.2)
helloworld (1)
HexEditor (0.9.12)
JSMinNPP (1.2205)
MarkdownViewerPlusPlus (0.8.2)
mimeTools (3.1)
NppConverter (4.6)
nppcrypt (1.0.1.6)
NppExport (0.4)
NppFTP (0.29.9)
NppQrCode64 (0.0.0.1)
ShtirlitzNppPlugin (1.1.2)
XMLTools (3.1.1.13)
Anything else?
No response
A duplicate of:
- https://github.com/notepad-plus-plus/notepad-plus-plus/issues/14210
No, it's not. The guy uses utf8 and his length is correct, while I work in 1-byter and the len must be plain raw
@u07 Maybe provide some actual data, so others can try to reproduce?
Sure, here. Actual codepage is Cyrillic -> 866
The Length is 73 bytes (which, by a happy and incredible coincidence, is rendered into 73 letters).
@u07
What I see while debugging is that your 73 bytes long ANSI file is transformed to UTF-8 117 bytes long data (when you use that Encoding > Characters sets > Cyrillic > OEM 866):
-
you will end up in this place (
fileFormat._encodingis 866 there): https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/ScintillaComponent/Buffer.cpp#L1848-L1853 -
when I then step there into the
wmc.encodecall, I see: https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/MISC/Common/Common.h#L97-L101 -
it means that the data is being transformed/mapped to UTF-16 1st (
char2wcharuses the MultiByteToWideChar WINAPI for it) but then not back to one byte per-char OEM866 but to UTF-8 (CP 65001) inside thewchar2char:
So what you see (117) is the number of bytes representing your OEM866 data encoded for Scintilla UTF-8 editor control representation. I've already pointed to that problem: https://github.com/notepad-plus-plus/notepad-plus-plus/pull/15346#issuecomment-2191932829 and got this answer:
So IMO the Length: info should be recalculated to real char-count.
It's arguable. But "Length:" on status bar for me should be the file length
rather the character count, so user opens a file he/she knows the file length.
If user needs to know the character count, he/she can select all the document
by hitting Ctrl-A, then "Sel:" will show the character count.
That's about what I expected to be happening, cause scintilla will work with either ansi or utf and definitely not 866. Thanks for the analysis, xomx - and speaking about length, while it may be the matter of discussion if we should calculate byte length or symbol length, it is absolutely certain that the value must reflect the actual file properties and not some internal intermediate data representation. Shouldn't it?
the value must reflect the actual file properties and not some internal intermediate data representation
It should be like that, at least I feel the same way (which is why I asked the N++ author in that 15346 comment before).
Shouldn't it?
Yes (even if I previously thought - no, for performance reasons - I will explain later).
The length: status-bar item is not the only one affected in this issue - check also the behavior of the Pos: , it is also byte- and not char-oriented item (move the cursor from the start of your test file data just before the first non-ANSI Cyrillic char, i.e. Col: 19 & Pos: 19, then move to the next char and you will see Col: 20 & Pos: 21...
This is UTF-8, where characters could have different length in bytes (max for Unicode is 4B, but according to UTF-8 spec - up to 6 bytes!). So for switching the length: and Pos: to more user-intuitive characters representation, you have to count all the Scintilla characters from a safe byte-offset position (e.g. from the file start 0 B offset), where you know that you are not in a middle-of-the-UTF8-char-pos. And here I previously ended, because of I saw the big computing differences of the Scintilla SCI_GETTEXTLENGTH/SCI_GETLENGTH (fast returning of the already known document bytes length value stored):
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.h#L519
and SCI_COUNTCHARACTERS, whis is a "heavyweight opponent":
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L1805-L1815
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L810
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L871
So I thought it is a no go (taking into account that users nowadays open in N++ even GB-files and considering that every document edit will then need such status-bar items recalculation...).
But now I've got an idea that the "safe-starting-byte-offset" in a UTF-8 file could be also a line-break offset and that is a game changer as we can only use the heavyweight SCI_COUNTCHARACTERS from the last known file line No. for length: ... SCI_GETLINECOUNT
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L2533-L2535
and the current line No. for Pos: ... SCI_LINEFROMPOSITION
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Editor.cxx#L6442-L6445
Implementation of the above (only for the Notepad_plus::updateStatusBar() visual info) should be easy.
Edit: Found a related issue: #9095
But now I've got an idea
Implementation of the above (only for the
Notepad_plus::updateStatusBar()visual info) should be easy.
Forget it, it will not work (calculating the UTF-8 chars in multiple parallel "one-line-only-threads"), it's too complicated (and still too slow for really big files) and only for a small gain here.
So what remains as possibility here is to somehow make it more clear in the N++ UI that in the case of non-ANSI chars occurrence in the doc, the lenght: is N++ tab buffer length in bytes of the Scintilla UTF-8 document representation (and in the case of Pos: , a file byte offset in it). A tooltip time? Changing the Pos: to Offs: and lenght: to lenght (B):?
Found another counting-in-chars way, native Scintilla one and relatively simple, now need to measure its overall performance impact. If ok I will create a testing PR preview.
One question - with EOL set in N++ to Windows CRLF, should it be taken as one char or two? E.g. what is your expectation when you in opened doc cross an end of line - should the Pos: in chars be incremented by one or two? (when counting-in-file-representation-bytes, the answer is simple - by two)
Right meow it is counting all the invisible chars like emoji style modificators, etc but does not count BOM although it looks like a stuff of same kind...
Akelpad counts EOL as two
does not count BOM although it looks like a stuff of same kind...
N++ always strips any possible BOM before loading file data into Scintilla buffer (m_nSkip):
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/Utf8_16.cpp#L216-L234
IMO the right thing, IDK any Windows text editor which counts the file BOM as the part of the file data.
Right meow it is counting all the invisible chars like emoji style modificators
Yes, currently it is counting-in-file-representation-bytes, so it is expected so.
Akelpad counts EOL as two
Ok. I'm inclining to that too. After all, they are really two valid (although not directly visible) chars even in the UTF-8.
@xomx IMO CRLF should be counted as two.