notepad-plus-plus Alphabetical text sorting does not work very well, especially for not-English text (but not only)

Description of the Issue

The expected behaviour of the sort function is to operate correctly not only for English. Unfortunately, it is not the case.

Steps to Reproduce the Issue

Open a new text file and write, one word in a line, ex. laka łąka mąka mika

(note EOL = the end of the line at the end of the text) These are Polish words, and the lines have already been sorted correctly by the alphabetical order. Nevertheless, try to sort them in Notepad++.

Expected Behavior

The function “Sort” should not change anything in this text

Actual Behavior

The final EOL is transferred at the beginning of the file, and the words are not correctly sorted any longer (the correct alphabetical order is totally ignored). You receive:

laka mika mąka łąka (note the empty line at the beginning; there is no EOL after the word “łąka”)

The bug is checked in Notepad++ v. 8.5 BTW, the correct alphabetical order in Polish is aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźż. Each letter is treated separately, not as a variant with a diacritic.

Mar 31 '23 12:03 grzegorj

There are TWO distinct issues here.
It is too bad that you combined them, as one could easily be fixed, but not the other. Combining them probably means nothing gets fixed.

Mar 31 '23 12:03 alankilborn

Indeed, there are two issues (but within the same scope). For me, ignoring the correct alphabetical order is much more important than the issue with EOL (which I have noticed by the way). Either Notepad++ is “totally English” (and then it can sort only in accordance to ASCII codes), or it is international. But if the other alternative is correct, the lack of a possibility to sort text lines correctly is a very serious deficiency of the program. Of the programs I know, only MS Word can do it correctly, using the system locale (which is present under Windows) instead of low-level sorting based on ASCII codes. It would be REALLY nice if also Notepad++ had a similar option. And it would not be so English-biased any longer as it is now, despite existing superficial localizations (of the menu etc.). Really. The thing is worth fixing. Even if it may not be very simple.

Mar 31 '23 22:03 grzegorj

Of the programs I know, only MS Word can do it correctly, using the system locale (which is present under Windows) instead of low-level sorting based on ASCII codes. It would be REALLY nice if also Notepad++ had a similar option.

Apples to oranges.

Notepad++ is a source code editor. Source code is almost always (for historical reasons) a plain text file of ASCII keywords and punctuation symbols.

MS Word is a business document editor. Businesses have offices around the world, so naturally their software can intelligently edit internationalized text.

Apr 01 '23 05:04 rdipardo

Notepad++ is a source code editor. Source code is almost always (for historical reasons) a plain text file of ASCII keywords and punctuation symbols.

Not exactly.

Notepad++ is a free (as in “free speech” and also as in “free beer”) source code editor and Notepad replacement that supports several languages. (https://notepad-plus-plus.org/)

As you can read, Notepad++ is not only source code editor. It is also a Notepad replacement. And it, as if, supports several languages. It is not true: in its present shape, it cannot sort texts in several languages. So, there IS a problem with sorting, contrary to what has been written.

Besides, Notepad++ can serve Unicode texts. Programmers do not need Unicode to write codes! So, the statement that Notepad++ is only a source code editor is completely groundless. So, rdipardo, don’t you think that linking the Wikipedia article on the historical reasons for using English in source code was with no relation to the subject and just rude of your side?

Perhaps you use Notepad++ only to writing code. I do not, and plenty of other uses do not as well. I also use it to my not business notes, and it is fully consistant with the program’s declared purpose. Notepad is also not a business notes editor, and also not a source code editor.

Moreover, the function under question is called “Sort”, and not “Sort the code lines”! Telling the truth, I cannot imagine the need of sorting code. But I do see a need to sort a text file that is not a code. And such a text needn’t be in English. So, the lack of correct sorting is a serious bug of the program, and there is a need to fix it.

In other words: a source code does not need to be sorted. If the Notepad++ was only a source code editor, it would not need the sort function at all. But if it has the function, it really should sort texts correctly, and not only in accordance with the English alphabet.

I am not a programmer and I cannot help with it, but I do not think it is a really hard thing to implement correct sorting of texts, so I cannot understand this negative argumentation. It is enough to forget ASCII codes of the characters, and to use different codes of them, instead. For a Polish text: assign “0” to space, “1” to “a” (and “A”, and also “á”, “Á”, “à”, “À”, “ä”, “Ä”, “å”, “Å” etc., as foreign letters “a” with diacritics should also be sorted in Polish text as if they were simple “a”’s), “2” to “ą” (and to “Ą”), “3” to “b” etc. And then sort the text in accordance with these codes, and not with ASCII codes. Make a similar table of codes for each language Notepad++ supports. That’s all. I do not think it would be really hard. The procedure of sorting itself should be exactly the same as it is now, except those mentioned letter codes was sorted, not ASCII codes as it is now.

Once again, if rdipardo uses Notepad++ as a source code editor, he does not need to use sorting at all. But I use it for different purposes, and I need sorting. Since the program is declared to be a Notepad replacement and not a source code editor only, please stop discuss about the motivation, and just fix the bug of the program.

Thank you in advance.

Apr 02 '23 07:04 grzegorj

One more note to what rdicardo wrote.

As I have noticed, Notepad++ does not use a two-phase sorting procedure. In this style, first, capital and lowercase letters are treated the same, and then capital letters gain higher priority over the corresponding lowecase letters.

Instead, for example, the text: Basia asia basia Asia

is sorted to:

Asia Basia asia basia (with the mentioned transferring of the EOL sign at the beginning) which is totally incorrect even for code writers!

The expected result of sorting is: Asia asia Basia basia

(with the option “capitals first”), or: asia Asia basia Basia

(with the option “lowercase first”).

So, Notepad++ cannot sort correctly even English text.

BTW. The correct way of sorting should take into account diacritic letters. E.g. German umlauted letters, contrary to Polish letters of the type “ą” or “ł”, must be treated as simple letters when sorting. But when two words differs only in umlaut, the plain letter must be taken first. Ex. the correct alphabetic order is: Ohr, Öhr, Ohrenarzt, ohrenbetäubend, Ohrenbläser.

In Polish, “c” and “ć” are different, but “c” and “č” are the same (“č” is not a letter of the Polish alphabet). So, the correct order is: cap, capek, Čapek, car, czysty, ćma. Notice that “c” and “č” have the same place in the alphabet sorting, but when “háček” is the only difference, the word with the accented letter must follow the word with the plain letter.

Sorting is not a simple procedure (but should really be simple to implement), and treating sorting of text lines as sorting of ASCII codes is a serious bug. Even for programmers...

Apr 02 '23 09:04 grzegorj

I have just found a program which is a declared as a clear programmer’s editor, PsPad (http://www.pspad.com/en/). Despite it is a clear source code editor and not a text editor, it can sort texts fully correclty, and even better than MS Word does.

Anyway, compares of oranges to apples are just clear fantasies of an arogant person. We should collaborate and not admonish others and not pretend to be smart (while being nothing more but rude) and send others away to Wikipedia articles for them to learn such or another thing. PsPad is not a “business document editor” and can sort text files correctly. So, Notepad++ has a serious bug, and lacks correct sorting.

Dear developers, please correct this bug. Take an example from the other source code editor!

Apr 02 '23 15:04 grzegorj

@grzegorj , in spite of your obnoxious lecturing of programmers about what "should really be simple to implement" despite having not the faintest idea of how little relationship there is between the outside-view perceived difficulty of programming something and the actual difficulty of implementing it, I happen to be interested in trying to work on this problem, or at least a more tractable subset of it.

Newer versions of Notepad++ have Line Operations->Sort lines lexicographically [asc/desc]ending ignoring case, which addresses one of the issues you mentioned in your last post. 8.3.3, which is a year old, is the oldest version I've checked that has it.

I see what you're saying about the transposition of empty lines when you sort the file. Probably the easiest solution would simply be to remove all empty lines from the region to be sorted before sorting.

Also looks like adding a function similar to ToUpperInvariant but using LOCALE_USER_DEFAULT instead of LOCALE_INVARIANT.

So my new function would be

static TCHAR ToUpperCultureSensitive(TCHAR input)
{
	TCHAR result;
	LONG lres = LCMapString(LOCALE_USER_DEFAULT, LCMAP_UPPERCASE, &input, 1, &result, 1);
	if (lres == 0)
	{
		assert(false and "LCMapString failed to convert a character to upper case culture-insensitively");
		result = input;
	}
	return result;
}

Apr 06 '23 00:04 molsonkiko

Any help trying to make my new sorter class culture-sensitive would be appreciated. I'm pretty sure it doesn't work yet though.

Apr 06 '23 06:04 molsonkiko

(note the empty line at the beginning; there is no EOL after the word “łąka”)

I see what you're saying about the transposition of empty lines when you sort the file.

@grzegorj 's complaint regarding something that is empty is that sorting (lex. ascending) this file:

incorrectly results in this:

How is that in any way correct? But, stunningly, the author of Notepad++ judged it so. :-(

Further, and also stunning, if one does Ctrl+a before the sort, to select all text and every line in the file:

Then the sort result IS correct:

Probably the easiest solution would simply be to remove all empty lines

If one does Remove Empty Lines, pre-sort:

And then sorts, a reasonable result is obtained:

But, as I don't consider the last "line" as a true line unless it has a line-ending on it (and I enforce this, for all of my files, with the editorconfig plugin), I can't obtain the sort result I want without doing a lot of "pre-thinking" about how to obtain correct results.

THIS is the part that I wish @grzegorj had broken out into a completely separate issue when I said:

There are TWO distinct issues here.

Combining it with a language-specific sorting order problem completely loses this detail.

Apr 06 '23 11:04 alankilborn

Good point.

It looks like there's some logic for removing empty lines in NumericSorter in Sorters.h that could be changed slightly, moved to a separate function, and called before sorting in each of the ISorter implementations in that file.

Apr 06 '23 17:04 molsonkiko

It looks like there's some logic for removing empty lines in NumericSorter ...

No empty lines need to be removed. Just more care need to be taken into account to avoid that empty non-line at the end of file, rather than letting it get drug into the search (and thus "sorted" to the top of the results).

Apr 06 '23 17:04 alankilborn

maybe changing this function to the below would work? It did for me locally, now if there's an empty line at EOF it's not moved. The current version of this function specifically chose to not remove that last empty line if sorting the entire document. So basically, went out of its way to behave in a way you didn't like.

void ScintillaEditView::sortLines(size_t fromLine, size_t toLine, ISorter* pSort)
{
	if (fromLine >= toLine)
	{
		return;
	}

	const auto startPos = execute(SCI_POSITIONFROMLINE, fromLine);
	const auto endPos = execute(SCI_POSITIONFROMLINE, toLine) + execute(SCI_LINELENGTH, toLine);
	const generic_string text = getGenericTextAsString(startPos, endPos);
	std::vector<generic_string> splitText = stringSplit(text, getEOLString());
	bool lastLineEmpty = splitText.rbegin()->empty();
	if (lastLineEmpty)
	{
		splitText.pop_back();
	}
	assert(toLine - fromLine + 1 == splitText.size());
	const std::vector<generic_string> sortedText = pSort->sort(splitText);
	generic_string joined = stringJoin(sortedText, getEOLString());
	assert(joined.length() + getEOLString().length() == text.length());
	if (lastLineEmpty)
	{
		joined += getEOLString();
	}
	if (text != joined)
	{
		replaceTarget(joined.c_str(), startPos, endPos);
	}
}

Apr 06 '23 18:04 molsonkiko

@grzegorj

A new plugin called Columns++ offers a sorting option; you may want to try it to see if it sorts your text differently/better.

See:

https://community.notepad-plus-plus.org/topic/24353/new-plugin-columns
https://github.com/Coises/ColumnsPlusPlus

Apr 09 '23 15:04 alankilborn

Second alankilborn's suggestion. Caveat: (at the time of this writing) you need to select the region you want to sort first, or the command does nothing. Before:

bass
baßk 
bLue
blue
blüe
blve
oyster
öyster
spä
spb
Spb

after Columns++ -> sort descending (locale):

spb
Spb
spä
öyster
oyster
blve
blüe
bLue
blue
baßk 
bass

Given that this functionality now exists in a plugin, do people think this request is too niche to bother including as a default feature, or should it be added anyway (probably stealing code from the plugin)?

Apr 09 '23 20:04 molsonkiko

Caveat: (at the time of this writing) you need to select the region you want to sort first, or the command does nothing.

I don't think that is a caveat ... the plugin is built around the concept of acting on a selection, specifically a column-selection.

do people think this request is too niche to bother including as a default feature, or should it be added anyway

I think it would be best as a core N++ feature.

Apr 09 '23 23:04 alankilborn

I guess the other question is: does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort on top of the current lexicographic one?

Apr 10 '23 14:04 molsonkiko

does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort

I'm sure if the new one replaces the old one, someone will be impacted that they can't do the same sort they used to. I'd suggest adding, not replacing.

Apr 10 '23 23:04 alankilborn

Just a note, for what it might be worth:

When I added sort functions to Columns++, my purpose was to deal with the fact that Notepad++ sort with a rectangular selection doesn't work when there are tabs in the file. I thought of suggesting a change in Notepad++ and offering a pull request, but I realized the entire sorting strategy used by Notepad++ would have to change. My impression is that the existing sort is meant to be reasonably efficient even for very large files. That was not a priority for me, while working with tabs was.

I came up with the "locale" sort when I started to look into exactly what I'd need to do to make a "case insensitive" sort. Outside the ASCII range, that concept isn't well-defined without specifying a locale; knowing the code page isn't good enough. That led me to the code I use here. (At present, options is always LCMAP_SORTKEY | NORM_LINGUISTIC_CASING | LINGUISTIC_IGNORECASE | SORT_DIGITSASNUMBERS.) This method fit sensibly with what I was already doing (storing sort keys as members of objects in a vector, with the full line data stored elsewhere), but it wouldn’t mesh with the method Notepad++ uses.

I plan at some point to add a user option to select a locale other than the user default, and to change some of the other options associated with the derivation of the sort key. Right now I think the sort I use is the same as Windows uses to sort filenames in Windows Explorer.

One other thing... no sort based on Windows locale sorting is anything like what people think of as a "case sensitive" sort. When you don't specify LINGUISTIC_IGNORECASE (or NORM_IGNORECASE), the sort still uses the relevant alphabetical order as the primary sort; only when every letter and symbol are equal except for case (as defined in that locale) does the sort distinguish case. The familiar ASCII order with all the capitals ahead of all the lower case does not exist in a locale-sensitive context.

Apr 19 '23 19:04 Coises

Thanks, @Coises ! I was going to add lexicographic ignore-case culture-sensitive sorting as an option in addition to the existing case-insensitive sorting, and probably just use your code as a jumping-off point. Unfortunately I have another PR already open for the other issue the OP raised (namely, EOF at EOL being sorted to top of file), and I can't start working on implementing this new sort until something is done on my other PR.

Apr 19 '23 19:04 molsonkiko

I can't start working on implementing this new sort until something is done on my other PR

Why's that?

Apr 19 '23 20:04 alankilborn

does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort

I'm sure if the new one replaces the old one, someone will be impacted that they can't do the same sort they used to. I'd suggest adding, not replacing.

My personal feeling is that a case-sensitive locale sort is useless, and a case-insensitive sort that isn’t locale-aware is incoherent (though if you only use ASCII, you’d never notice). As a practical matter, there are already a lot of sorts on that menu... so while I’d lean towards replacing just the “Ignoring Case” sorts with locale-aware sorts, picking the right terminology so users don’t get confused could be challenging.

As I mentioned, though, there’s also the problem that the sorting strategy used by Notepad++ is not readily adaptable to locale-based sorting (nor to rectangular selections when there are tabs in the file).

Apr 19 '23 20:04 Coises

I can't start working on implementing this new sort until something is done on my other PR

Why's that?

Because I was being silly and forgot that I could just create a new branch on my fork that was independent of the branch I was using for my other PR. 😝

I've gotten back to work on the culture sensitive locale sort, and results so far are encouraging.

@Coises , when I was studying your code in Columns++'s Sort.cpp, I couldn't help but notice that you use SORT_DIGITSASNUMBERS, which I looked up in the relevant documentation. When I tried to use it in my fork of notepad_plus_plus, though, I couldn't use it. I'm on Windows 10, and #include <WinNls.h> didn't make it visible. What do you think might be going on?

Apr 23 '23 04:04 molsonkiko

@molsonkiko Just a guess; from WinNls.h:

//  Sort digits as numbers (ie: 2 comes before 10)
#if (WINVER >= _WIN32_WINNT_WIN7)
#define SORT_DIGITSASNUMBERS      0x00000008  // use digits as numbers sort method
#endif // (WINVER >= _WIN32_WINNT_WIN7)

Is it possible that you're compiling for a Windows version lower than that?

Apr 23 '23 05:04 Coises

Is it possible that you're compiling for a Windows version lower than that?

In retrospect, yes, obviously. Notepad++ would definitely be compiling for Windows 7 because it's been around for so long.

Apr 23 '23 05:04 molsonkiko

@molsonkiko Hmmm... I see: <WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion> in Notepad++ and in your fork. Is it possible you have something different on your development machine? (It is also possible that I don't know what I'm looking at...)

Apr 23 '23 06:04 Coises

I see: <WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion> in Notepad++ and in your fork. Is it possible you have something different on your development machine? (It is also possible that I don't know what I'm looking at...)

Despite the name, it actually refers to the SDK version. Visual Studio's corresponding option is more explicit:

It determines the compiler's header search path and the linker's library path.

You could think of it as the maximum supported Windows version, since every new feature goes into the SDK before it's released in the next build of the OS.

Only developers of bleeding-edge apps need to care to it. Notepad++ runs almost entirely on core system libraries that have barely changed in 35 years.

Apr 23 '23 10:04 rdipardo

@rdipardo , thanks for explaining how that works.

I'm not 100% sure what that means about the possibility of including SORT_DIGITSASNUMBERS, though. I assume based on this discussion that we would have to change the Notepad++ core vcxproj file.

Apr 24 '23 00:04 molsonkiko

@molsonkiko Check the setting @rdipardo illustrated in the development environment where you couldn't compile using SORT_DIGITSASNUMBERS.

If that doesn't explain it, follow LINGUISTIC_IGNORECASE to WinNls.h (right-click and choose Go To Definition) and scroll until you see a preprocessor statement involving WINVER. Hover over WINVER and see what the tooltip says is its value. It must be at least 0x0601 for SORT_DIGITSASNUMBERS to work. If it's not, and the Windows SDK Version setting doesn't explain it... then there's something to figure out, I guess.

Apr 24 '23 00:04 Coises

Well, I tried playing with those settings, and it didn't help.

So just on a stupid whim, I just inserted #define SORT_DIGITSASNUMBERS 0x00000008 in Common.h (copied from WinNls.h per Coises' suggestion), and wouldn't you know, it worked.

Now Sort Lines Lex. Asc. Culture-sensitively ignoring case sorts

11
2

to

2
11

Hooray! 🎉

Apr 24 '23 02:04 molsonkiko

from WinNls.h:

//  Sort digits as numbers (ie: 2 comes before 10)
#if (WINVER >= _WIN32_WINNT_WIN7)
#define SORT_DIGITSASNUMBERS      0x00000008  // use digits as numbers sort method
#endif // (WINVER >= _WIN32_WINNT_WIN7)

Is it possible that you're compiling for a Windows version lower than that?

As it turns out, yes: see how the property sheet defines _WIN32_WINNT:

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/c76f178534bba518f263c8120b732801e13c4916/PowerEditor/visual.net/notepadPlus.Cpp.props#L28

_WIN32_WINNT_VISTA is less than _WIN32_WINNT_WIN7, causing #if (WINVER >= _WIN32_WINNT_WIN7) to return false, so SORT_DIGITSASNUMBERS is never defined.

Swapping in _WIN32_WINNT_WIN7 lets @molsonkiko's fork compile without any kludge. Adding a #define that might someday be duplicated is just an accident waiting to happen.

Since N++ dropped Vista in 8.4.7 (or rather, Microsoft did), it's a good time to ask if _WIN32_WINNT can target Win7 instead.

Apr 24 '23 06:04 rdipardo