notepad-plus-plus icon indicating copy to clipboard operation
notepad-plus-plus copied to clipboard

[BUG] Wrong file length in the statusbar

Open u07 opened this issue 7 months ago • 12 comments

Description of the Issue

After selecting proper 1-byte codepage in the menu the internal representation of data becomes unicoded. So the displayed length now is x2 times greater than it is.

It also breaks the HEX plugin, now it displays bullshit instead of raw file data.

Related: #16468 #15919

Steps To Reproduce

  1. Open 1-byte txt

Image

  1. Select another 1-byte codepage.

Image

  1. See wrong length - now it's unicoded internally

Image

Current Behavior

Length doubled

Expected Behavior

Correct length

Debug Information

Notepad++ v8.8.1   (64-bit)
Build time : May  3 2025 - 18:41:09
Scintilla/Lexilla included : 5.5.6/5.4.4
Boost Regex included : 1_85
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : "C:\Program Files\Notepad++\change.log" 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : OFF
Placeholders : OFF
Scintilla Rendering Mode : SC_TECHNOLOGY_DIRECTWRITE (1)
Multi-instance Mode : monoInst
File Status Auto-Detection : cdEnabledNew (for current file/tab only)
Dark Mode : OFF
OS Name : Windows 7 Professional (64-bit)
OS Build : 7601.0
Current ANSI codepage : 1251
Plugins : 
    BigFiles (0.1.3)
    ComparePlugin (2.0.2)
    helloworld (1)
    HexEditor (0.9.12)
    JSMinNPP (1.2205)
    MarkdownViewerPlusPlus (0.8.2)
    mimeTools (3.1)
    NppConverter (4.6)
    nppcrypt (1.0.1.6)
    NppExport (0.4)
    NppFTP (0.29.9)
    NppQrCode64 (0.0.0.1)
    ShtirlitzNppPlugin (1.1.2)
    XMLTools (3.1.1.13)

Anything else?

No response

u07 avatar May 22 '25 06:05 u07

A duplicate of:

  • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/14210

freezer2022 avatar May 22 '25 07:05 freezer2022

No, it's not. The guy uses utf8 and his length is correct, while I work in 1-byter and the len must be plain raw

u07 avatar May 22 '25 07:05 u07

@u07 Maybe provide some actual data, so others can try to reproduce?

alankilborn avatar May 22 '25 10:05 alankilborn

Here is the data.txt

Sure, here. Actual codepage is Cyrillic -> 866

The Length is 73 bytes (which, by a happy and incredible coincidence, is rendered into 73 letters).

u07 avatar May 22 '25 10:05 u07

@u07

What I see while debugging is that your 73 bytes long ANSI file is transformed to UTF-8 117 bytes long data (when you use that Encoding > Characters sets > Cyrillic > OEM 866):

  • you will end up in this place (fileFormat._encoding is 866 there): https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/ScintillaComponent/Buffer.cpp#L1848-L1853

  • when I then step there into the wmc.encode call, I see: https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/MISC/Common/Common.h#L97-L101

  • it means that the data is being transformed/mapped to UTF-16 1st (char2wchar uses the MultiByteToWideChar WINAPI for it) but then not back to one byte per-char OEM866 but to UTF-8 (CP 65001) inside the wchar2char:

    Image

So what you see (117) is the number of bytes representing your OEM866 data encoded for Scintilla UTF-8 editor control representation. I've already pointed to that problem: https://github.com/notepad-plus-plus/notepad-plus-plus/pull/15346#issuecomment-2191932829 and got this answer:

    So IMO the Length: info should be recalculated to real char-count.

It's arguable. But "Length:" on status bar for me should be the file length
rather the character count, so user opens a file he/she knows the file length.
If user needs to know the character count, he/she can select all the document
by hitting Ctrl-A, then "Sel:" will show the character count.

xomx avatar May 22 '25 17:05 xomx

That's about what I expected to be happening, cause scintilla will work with either ansi or utf and definitely not 866. Thanks for the analysis, xomx - and speaking about length, while it may be the matter of discussion if we should calculate byte length or symbol length, it is absolutely certain that the value must reflect the actual file properties and not some internal intermediate data representation. Shouldn't it?

u07 avatar May 22 '25 18:05 u07

the value must reflect the actual file properties and not some internal intermediate data representation

It should be like that, at least I feel the same way (which is why I asked the N++ author in that 15346 comment before).

Shouldn't it?

Yes (even if I previously thought - no, for performance reasons - I will explain later).

The length: status-bar item is not the only one affected in this issue - check also the behavior of the Pos: , it is also byte- and not char-oriented item (move the cursor from the start of your test file data just before the first non-ANSI Cyrillic char, i.e. Col: 19 & Pos: 19, then move to the next char and you will see Col: 20 & Pos: 21...

This is UTF-8, where characters could have different length in bytes (max for Unicode is 4B, but according to UTF-8 spec - up to 6 bytes!). So for switching the length: and Pos: to more user-intuitive characters representation, you have to count all the Scintilla characters from a safe byte-offset position (e.g. from the file start 0 B offset), where you know that you are not in a middle-of-the-UTF8-char-pos. And here I previously ended, because of I saw the big computing differences of the Scintilla SCI_GETTEXTLENGTH/SCI_GETLENGTH (fast returning of the already known document bytes length value stored):

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.h#L519

and SCI_COUNTCHARACTERS, whis is a "heavyweight opponent":

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L1805-L1815

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L810

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L871

So I thought it is a no go (taking into account that users nowadays open in N++ even GB-files and considering that every document edit will then need such status-bar items recalculation...).

But now I've got an idea that the "safe-starting-byte-offset" in a UTF-8 file could be also a line-break offset and that is a game changer as we can only use the heavyweight SCI_COUNTCHARACTERS from the last known file line No. for length: ... SCI_GETLINECOUNT

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Document.cxx#L2533-L2535

and the current line No. for Pos: ... SCI_LINEFROMPOSITION

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/scintilla/src/Editor.cxx#L6442-L6445

Implementation of the above (only for the Notepad_plus::updateStatusBar() visual info) should be easy.

Edit: Found a related issue: #9095

xomx avatar May 23 '25 01:05 xomx

But now I've got an idea

Implementation of the above (only for the Notepad_plus::updateStatusBar() visual info) should be easy.

Forget it, it will not work (calculating the UTF-8 chars in multiple parallel "one-line-only-threads"), it's too complicated (and still too slow for really big files) and only for a small gain here.


So what remains as possibility here is to somehow make it more clear in the N++ UI that in the case of non-ANSI chars occurrence in the doc, the lenght: is N++ tab buffer length in bytes of the Scintilla UTF-8 document representation (and in the case of Pos: , a file byte offset in it). A tooltip time? Changing the Pos: to Offs: and lenght: to lenght (B):?

xomx avatar May 23 '25 17:05 xomx

Found another counting-in-chars way, native Scintilla one and relatively simple, now need to measure its overall performance impact. If ok I will create a testing PR preview.

One question - with EOL set in N++ to Windows CRLF, should it be taken as one char or two? E.g. what is your expectation when you in opened doc cross an end of line - should the Pos: in chars be incremented by one or two? (when counting-in-file-representation-bytes, the answer is simple - by two)

xomx avatar May 26 '25 10:05 xomx

Right meow it is counting all the invisible chars like emoji style modificators, etc but does not count BOM although it looks like a stuff of same kind...

Akelpad counts EOL as two

u07 avatar May 26 '25 10:05 u07

does not count BOM although it looks like a stuff of same kind...

N++ always strips any possible BOM before loading file data into Scintilla buffer (m_nSkip):

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/a10cebe2cd99ea39a0258365c5c4585599471c58/PowerEditor/src/Utf8_16.cpp#L216-L234

IMO the right thing, IDK any Windows text editor which counts the file BOM as the part of the file data.

Right meow it is counting all the invisible chars like emoji style modificators

Yes, currently it is counting-in-file-representation-bytes, so it is expected so.

Akelpad counts EOL as two

Ok. I'm inclining to that too. After all, they are really two valid (although not directly visible) chars even in the UTF-8.

xomx avatar May 26 '25 11:05 xomx

@xomx IMO CRLF should be counted as two.

alankilborn avatar May 26 '25 12:05 alankilborn