DSpellCheck mult lang spell don't work

Multiple languages spell check stop work after v8.1.5 (probably) Now I have DSpellCheck 1.4.21.0 - 64bit Notepad++ v8.1.7 - 64bit and spell check works only for 1 language...

multi

Oct 17 '21 13:10 brolny

Unfortunately cannot be reproduced from my side, maybe it is some problem with settings so multiple languages are no longer chosen, what it shows if you press Set Multiple Languages...?

Nov 06 '21 16:11 Predelnik

@Predelnik

I'm also affected by the problem (or a very similar problem) since a couple of months. But only recently found a way to reliably reproduce it.

Notepad++ 8.4.3 (64-bit version) DSpellCheck-1.4.24.0 DSpellCheck is configured to use Hunspell with the "Multiple Languages"-option.

Test case

correct правильный, incarrect непровильный

[working configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia)": Everything works as expected, i.e. "incarrect" and "непровильный" are marked as misspelled.
[failing configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia), and Ukrainian (Ukraine)": DSpellCheck fails to detect all misspelled words, only "непровильный" is marked as misspelled. "incarrect" and any other arbitrary sequence of Latin alphabet letters are considered to be written correctly.

NB: A very similar problem has recently been fixed in Firefox: Firefox bug report, Firefox fix. Perhaps taking a look at the Firefox fix might help fixing the problem in DSpellCheck.

Jul 09 '22 20:07 er13

Thank you for the test case @er13, I can reproduce it and will certainly look at what is going on.

Jul 10 '22 19:07 Predelnik

So I understand the problem but I'm not sure whose responsibility it should be to fix that unfortunately. Also as far as I understand it seems to be different from firefox. The problem is that uk_UA.aff contains lines ICONV a 0 and so on for each Latin character. What it means is before we check any word against the dictionary all the Latin characters are changed to 0 (not \0 byte but literal 0) then Hunspell internally thinks that such words are actually numbers and tells that they are correct. Since my logic for multiple dictionary usage is checking against each dictionary and looking if any of them is correct in this case word is treated as correct.

Jul 10 '22 20:07 Predelnik

Hmm, what about ignoring (i.e. not checking) the words containing 0 only characters (just like Firefox ignores \0-words). Still a workaround, but better than the wrong behavior.

On the other hand, I also don't understand all these iconv [a-zA-Z] 0 lines in the Ukrainian dictionary and would rather say, the proper solution would be to remove them from the Ukrainian dictionary as they cause any sequence of Latin letters to be treated as a correctly spelled word, also with only one dictionary - the Ukrainian one - enabled.

Edit: the iconv-lines exist since April 2021 https://cgit.freedesktop.org/libreoffice/dictionaries/commit/?id=06a28cf2efe2e3fa887912989650dcaaf05f0958 https://gerrit.libreoffice.org/c/dictionaries/+/114308 https://bugs.documentfoundation.org/show_bug.cgi?id=141408

Edit2: the sources of the Ukrainian dictionary https://github.com/brown-uk/dict_uk

Jul 10 '22 22:07 er13

Created an issue in the project developing the Ukrainian dictionary.

https://github.com/brown-uk/dict_uk/issues/306

Jul 11 '22 08:07 er13

I believe it's possible to have a workaround which ignores words which had something converted to 0 because luckily Hunspell exposes the function to perform this conversion externally, I will have to consider doing it.

Jul 11 '22 08:07 Predelnik

@Predelnik: the workaround you implemented in https://github.com/Predelnik/DSpellCheck/pull/302/commits/c31e59bdbcd610c587e866acf2f023b7ff982d42 is not good.

The typical error/typo in Ukrainian is the usage of Latin «i» instead of Ukrainian «і» (e.g. «гранiтний» instead of «гранітний»). With you workaround the Latin «i» would be replaced with 0, the number of zeros would get increased and the whole word would be excluded from the spellchecking. This is not good.

Please check as requested for the whole word after dictionary conversion contains zeros only instead of the number of zeros after dictionary conversion has increased. Thanks!

Jul 12 '22 09:07 er13

On one hand what you are saying makes sense but seems like fix technically still works because I only add logic to treat word with Latin characters as incorrect without changes for other cases. So for the word containing Latin i it will stay incorrect as it was with 0 replacement, just dictionary won't be consulted about it. Strictly speaking it doesn't matter which logic there will be, yours or mine, unless there is some word which is correct after changing its Latin characters to zero and I don't think there is such a word.

Jul 12 '22 10:07 Predelnik

Hmm, to be honest I don't see your

because I only add logic to treat word with Latin characters as incorrect without changes for other cases

neither in https://github.com/Predelnik/DSpellCheck/pull/302/commits/c31e59bdbcd610c587e866acf2f023b7ff982d42 nor in src/spellers/HunspellInterface.cpp. Could you please point me out where this check for Latin only words takes place?

Besides that a performance argument:

With you check you have to iterate completely over every original word (because you count the number of zeros in it), you also have to iterate over every dictionary converted word (also because you count the number of zeros) and then compare both numbers. => So assuming your text contains N words you have definitely 2*N iterations, whereas each iteration definitely iterates over the whole word.
With my way to check you have to iterate only over the dictionary converted words and could break the iteration for each word after the first non-zero character. => So assuming the same text with N words you have only at most N iterations, whereas each iteration is only then takes places over the whole word if it's a Latin-characters-word.

Jul 12 '22 10:07 er13

Could you please point me out where this check for Latin only words takes place?

By word with latin characters I mean word containing some Latin characters so no there is no "check for Latin only words"

Performance argument doesn't matter because

The largest cost is repeating conversion which Hunspell will do later anyway.
I cannot break anywhere during conversion because it's done internally in Hunspell and I don't want to change its code.

I don't think all these things are important to be honest, if it works for your example case which seems to be true I fail to see the problem.

Jul 12 '22 11:07 Predelnik

I mean

std::ranges::count(word_to_check, '0') is the complete iteration over the original word
std::ranges::count(buf, '0') is the complete iteration over the converted word

If you change the name of the function from has_more_zeroes_after_conv to contains_zeros_only_after_conv and change the line

return std::ranges::count(word_to_check, '0') < std::ranges::count(buf, '0');

to

for (auto & ch: buf) {
    if (ch != '0')
        return false;
}
return true;

then you would significantly increase the performance.

Off course the costly line dic.hunspell->input_conv(word_to_check, buf); is always (i.e. in both cases) necessary.

Jul 12 '22 11:07 er13

But it's pretty obvious that in current case I can also short-circuit if I encountered any 0 which was not 0 before, I just don't think it's that important. Also not checking original word completely is not entirely correct also because the word initially made out of all 0s would probably needs to be treated as correct. My point is that all of this is not really important, it's just a workaround.

Jul 12 '22 11:07 Predelnik

Ok it's up to you, I just provided my argumentative feedback after reviewing the code.

My point is "less optimal code there", "less optimal code somewhere else" - at some point in time you will notice a performance degradation. If you have a chance to do it in a more optimal way, then do it straight away.

Jul 12 '22 12:07 er13

@Predelnik It looks like the developer of the Ukrainian dictionary is willing to remove the iconv lines in one of the upcoming releases. So wait a bit with merging the workaround into the master branch. It might become unnecessary.

I'm fine with the workaround based on the manual edit of uk_UA.aff until then.

And thanks once again for analyzing the issue. I would however consider my case as unrelated to that of the original bug reporter. The original bug reporter doesn't use the Ukrainian dictionary at all.

Jul 13 '22 07:07 er13

Thank you for the information, @er13, I will probably abstain from merging the workaround for now. Yes, the original bug is probably something different but it's hard to tell without feedback from the author unfortunately.

Jul 13 '22 07:07 Predelnik

I think the Korean language has a similar problem, although I'm unsure whether it's the same culprit.

I'm a total newbie when it comes to dictionary files structure, but https://raw.githubusercontent.com/LibreOffice/dictionaries/master/ko_KR/ko_KR.dic contains these lines:

a/32
b/32
c/32
d/32
e/32
f/32
g/32
h/32
i/32
j/32
k/32
l/32
m/32
n/32
o/32
p/32
q/32
r/32
s/32
t/32
u/32
v/32
w/32
x/32
y/32
z/32

and after removing them, the multi language spell check seems to be working correctly.

Nov 25 '23 16:11 tomasz1986

DSpellCheck DSpellCheck copied to clipboard

mult lang spell don't work

DSpellCheck
DSpellCheck copied to clipboard