Not playing well with 'dictionary-it' (2)
Hi, thanks for the wonderful library. The Italian (also Portuguese) languages hang my app that uses nspell. Is there anything new on that front (there is a closed issue on Italian dic). Will that problems be fixed in the future? Is there a workaround? Thanks, i.
There’s no news. The Italian dictionary/affix combination results in a huge list of possibilities, and I don’t know Italian so I can’t check if changing the code would still properly spellcheck it!
If you know Portuguese or Italian, we could start off where #9 ended?
Thanks for your quick answer! Unfortunately not. All languages that I speak work very well. Then again, you can find some Italian text on the web and you can see if spelling works as intended ...
I also tried this package:
https://www.npmjs.com/package/typo-js
and it has similar problems with same languages. Maybe vanilla js approach just won't cut it for those dictionaries :( Will investigate further ... i.
dictionary-pt-br has the same issue. :disappointed:
It leaks to an insane 1,4GB of RAM at runtime.
It worth checking Hunspell's official algorithm/implementation as it never runs above 50 MBs of RAM while checking against the same dict and process lots of words very fast. I'm currently trying to understand how it works, but sadly can't promise I will come back with a solution soon.
It leaks to an insane 1,4GB of RAM at runtime.
I’d appreciate a fix! PRs welcome!
It worth checking Hunspell's official algorithm/implementation as it never runs above 50 MBs of RAM while checking against the same dict and process lots of words very fast. I'm currently trying to understand how it works, but sadly can't promise I will come back with a solution soon.
We/I can’t, because that would taint this project as being a derivative work of GPL code. nspell is reverse-engineered.
We/I can’t, because that would taint this project as being a derivative work of GPL code. nspell is reverse-engineered.
Switching to GPL wouldn't be a problem for my usage, but I understand your concerns.
However, I found that Myspell (from which Hunspell is based on) is licensed permissively and is probably the base for the comparing algorithm. As I didn't looked into Hunspell's source yet, I will try checking out Myspell's one first, in order to keep any fork/PR compatible with the MIT license (in case I get any useful insight).
I just did some research on what happens during the load of the Italian dictionary, and the algorithm just gives up at line 6720: appuntare/ALKhlTXqrI.
If I comment out, it stops at the next line, so I think that's where the limit is. I don't know whether or not that helps, but as the Italian dictionary is rather large, maybe it's indeed required that the algorithm applies some memory saving options … I've seen that the ALKhlTXqrl-flag is used quite often, and it seems long, so maybe this helps?
I'll continue to research where the algorithm is defeated, so maybe I'll come up with a solution at some point!
It should also be noted that hunspell-spellchecker has the same issue (excessive memory consumption with the use of the IT dictionary).
@leonidasv pt-br now uses a new source. Hope this one works.
I suggest trying the Italian dictionary with nodehun to see if it works natively. The Italian source hasn’t been updated in 5 years, so maybe it’s just bad.
I don’t know Italian so I can’t check if changing the code would still properly spellcheck it!
@wooorm Italian is my mother tongue, if there's anything that I can do to help get this issue resolved I'd be happy to do that.
Thanks Fabio, appreciate that!
Currently focussing on other stuff, but when I get ready for another push on nspell I’ll ping you!
I've taken a closer look at this, with the recent performance improvements I can get nspell to stop hanging after a good but manageable amount of seconds, until eventually it throws this error for me:
RangeError: Value undefined out of range for undefined options property undefined
Which according to this SO answer means something ran our of keys available, and according to this v8 engineer's SO answer Maps have an hard-coded 2^24 max number of keys possible.
Basically as I'm understanding it the Italian dictionary is generating so many combination that the engine becomes incapable of handling them.
This should be fixable by using a pool of Map objects internally rather than a single one. It would be useful to measure how many actual keys the Italian dictionary needs though, if instead of 16M it's 16B or something then the whole effort would be pointless.
Alternative idea: the source of the dictionary hasn’t been updated in years. Do any of you Italian-speaking folk know if there are alternative groups making Italian dictionaries? Or is there a cap we can put on this expression to not generate a bazillion entries?
Thanks for digging into it, @fabiospampinato!
Or is there a cap we can put on this expression to not generate a bazillion entries?
I think yes: I mean, why don't we try to simply read in all contents of the .dic-file, and we already have the flags behind it, so why do we need to resolve these flags immediately? Wouldn't lazy-loading them be easy? I am currently thinking of the following (please tell me if I forget something that prevents this):
- The files are loaded, but this "loading" means "Save them to an internal line array"
- Once a word is looked up (e.g. "giorno", to stick with the italian example), we could take all possible candidates from the line array (we can be generous with this, because some words may have modifications as a prefix, the goal should only be to not parse all entries) and resolve them until we find a match. If the word is written wrongly, we won't get a match, but the upper cap for this would be once all candidates are resolved.
- Then save those resolved words in an internal map (maybe even distinguished by first letters, so that we have, e.g., 26 maps from A to Z containing all resolved words)
This would increase the computational requirements a little bit at the beginning, but this might first reduce the insane times we currently achieve even with English dictionaries during load, and enable loading any DIC file.
Any thoughts? What obvious thing did I overlook?
And thanks again for caring :)
Alternative idea: the source of the dictionary hasn’t been updated in years. Do any of you Italian-speaking folk know if there are alternative groups making Italian dictionaries?
I don't know anything about that.
Or is there a cap we can put on this expression to not generate a bazillion entries?
Doable, but ideally one wouldn't want the spell checker to understand 50% of a language and ignore the remaining 50% or something. That being said I think I got exactly 10 of those errors reported in my console, I don't know if the engine just gave up or if maybe we are only going like 0.00006% over the limit, which would be interesting.
In any case using multiple maps seems like a better option to me, as with a cap on the number of words we'll get nspell not recognizing some words or having a hard limit on the number of custom words that it can handle.
but this might first reduce the insane times we currently achieve even with English dictionaries during load
@nathanlesage with the recent performance optimizations I submitted I'm getting the English dictionary loaded in my ~6yo laptop in ~150ms or so, I think that's pretty good already, if you are running nspell in a worker it doesn't really matter much if it takes 0ms or 150ms to load a dictionary.
Any thoughts? What obvious thing did I overlook?
I'm not very familiar with the details of generating a dictionary currently, I think potentially something like that could be doable, but finding which words to lazily evaluate without parsing the entire thing sounds difficult to me, and it may backfire performance-wise, like performing that computation may pretty quickly take longer than just parsing the entire thing.
IMO further performance optimizations routes might be:
- Using character scanning rather than regexes for rules, this may reduce init time by ~15% or so, or not.
- Using a bloom filter or similar structure rather than asking the Map if a word is already present in it, as for example ~93% of the times for the french dictionary we are dealing with new words, maybe this could reduce init time by another ~20% or so, but I've never used bloom filters before so maybe I'm talking nonsense.
- Pre-parsing dictionaries, this could potentially remove most of the time required for initialization but the problem is things like the french dictionary become a 70MB file from a ~2MB raw dictionary file, so shipping that with the app or downloading that is not really feasible.
- Parallelizing initialization so that for example 4 worker threads might work on their own chunk of the dictionary, this could be doable and would reduce init time by an amount proportional to the number of workers and actual cores available, but chunking the dictionary is not trivial because some words can be added as prefix/suffix to other words for example.
- Ordering dictionaries by word usage, so like nspell could first load the first 10000 words and if mispelled words are found, requiring some suggestions, the next 10k words could be loaded, and then the next ones etc. until some suggestions are found. As a negative side effect this could reduce the number of suggestions provided, but maybe that would be a good enough trade off for some languages. But then again partially parsing dictionaries like that is not trivial.
- Maybe memory usage could be reduced significantly still by not joining words together, like if instead of storing "buongiorno" nspell stored "buon" and "giorno" in an array or something I think v8 could allocate those strings only once in memory for the entire dictionary, saving a good deal of memory for languages with lots of computed words, but then everything internally that currently works on strings would need to be rewritten to work on these kind of disjoined strings, which wouldn't be easy probably.
Folks, I think performance is an XY problem. There are two variables here: a) dictionaries, b) hunspell vs. nspell. Most dictionaries work fine with nspell. Some don’t. As mentioned before, Hunspell does okay on Italian. There is a bug in the code here which causes lots of words, which I don’t believe need to be added (but I don’t speak Italian and hence can’t help here), to be added. The solution is to look at what is going on in the code that shouldn’t be going on. Perhaps it’s depending on a feature that isn’t supported here?
@wooorm are you certain there's a bug generating unnecessary words? 16M+ words sounds like a lot to me but given how lots of words are spelled differently in italian depending on their gender or pluralization I can see how a much greater number of words could be generated for italian compared to english. I'll skim the generated list of words, maybe I can spot some seemingly problematic ones.
I thought that that was the problem, because (excuse me if I'm off) roughly related languages like Spanish or so don't have this problem.
Dictionaries are written by authors who sometimes aren't that fluent in programming. Some include bugs. I'm under the impression that this one isn't written well 🤷‍♂️ and that maybe hunspell is capping certain ginormous expansions whereas nspell isn't
I think I have some new interesting findings to share:
- Generating way fewer words:
- I've taken a closer look at the italian dictionary, and it generates about 30 million words, exceeding the ~16M limit significantly.
- The number of words generated by the italian dictionary explodes because italian has these silly prefixes that can be attached to words, for example in english one might say "that tree", in italian "that" can be translated as "quello"/"quella" when referring to a noun in the singular form, but it's common to truncate the trailing vowel and attach the remaining characters to the noun, so in italian one would translate that english bit with "quell'albero" rather than "quello albero". Another major issue is that words in italian, but I guess in all languages with a latin alphabet really, can start with an uppercase character if they are at the start of a sentence, that helps making the list of words generated by nspell explode because now in the list you'll have:
albero,Albero,quell'albero,quell'Albero,Quell'alberoandQuell'Albero.- Potential fix number 1: I think it would be totally reasonable to consider those prefixes as their own words, which they are grammatically technically, and instead just having "quell'", "Quell'", "albero" and "Albero" in the dictionary.
- Potential fix number 2: Duplicating the number of words just for accounting for the case where "albero" is at the start of a sentence sounds super wasteful to me, there shouldn't be "albero" and "Albero" in the dictionary really, but just "albero", I think we can probably account for this dynamically when computing suggestions and checking for correctness.
- Potential fix number 3: "Quell'Albero" is not actually correct spelling because "albero" is not a name like "Fabio" or "Italy", unless somebody explicitly gives something the "Albero" name for some reason, so it shouldn't be present in the final list of words at all.
- I think all these potential fixes together could cut down on the number of words generated, and by consequence CPU and RAM required, by up to ~80%.
- French also has a similar issue and that's why its dictionary contains millions of words too.
- Mind-blowing unbelievably amazing truly outstanding performance:
- Computing the list of words dynamically is where most of the CPU time is spent when loading a dictionary, and most of the RAM is spent just storing this huge list of words which contains largely redundant information (i.e. it compresses to a high degree), apparently it's possible to cut down on CPU time almost entirely by pre-parsing dictionaries, and cut down on RAM usage massively by storing the list of words in a fancy ~compressed data structure that doesn't need to be decompressed before being used, more details on how the data structure works here.
- With the optimization mentioned above I can get the English dictionary to load in ~2ms in my machine, rather than the current ~150ms, and, unbelievably, the gzipped precomputed list of words weighs less than the gzipped raw dictionary used to generate it in the first place, so it would take less time to download it too.
- With the same optimization I can get the French dictionary to load in ~20ms in my machine, rather than ~4s, milliseconds not seconds, and the ~compressed list of words weighs just 1.2MB, which doesn't require nearly as much RAM to keep in memory (currently instead we are at ~240MB for the french dictionary) which goes down to ~800kb when gzipped, and with the potential reduction on the number of words generated these figures would go down too.
- Also if for some reason a language legitimately requires 50 million words to be stored I think as a side effect of how the data structure required to generate this ~compressed list of words works we probably won't be hitting the hard-coded 16M keys limit anyway.
- What's the catch?
- Generating the ~compressed list of words is expensive, it took half an hour for the french dictionary on my machine, but that must be done only once.
- If the way lists of words are generated changes, i.e. if nspell gets better, the ~compressed lists of words should be calculated again.
- The ~compressed list of words can't be mutated dynamically, meaning adding and removing words from the dictionary must be implemented on top of that, but it should be doable I think.
- There aren't many libraries that I could find that could generate the "succinct", as they call them, data structure that we need, none of them widely used, so the available ones might be buggy and difficult to fix.
- It might be difficult to extend these libraries to work with arbitrary alphabets (i.e. they might not currently support storing accented characters for example, or vietnamese characters or cyrillic characters etc.).
I think first of all we should get those optimizations I mentioned in to cut down massively on the number of words generated, it would be great if @wooorm could take care of that as I'm not nearly as familiar with the library, that would bring down the time required to load even problematic dictionaries like the french or italian ones under 1s, which we can work with. Secondly I, or some other volunteer, will spend some time attempting to achieve that truly "final" mind-blowing optimization. Thirdly we will all share on Twitter our happiness for the outstanding performance achieved, especially in the face of the Rust-all-the-things freaks.
Thoughts?
This sounds amazing! Thank you for digging into it!
If the complete (!) list of computed words (without affixes, if I understand you correctly? Just the words we can search for?) is THAT useable, we could ditch hunspell completely, and just write a small program that generates these compressed lists from hunspell dictionaries. We can leverage Github Actions for that.
Then we would just need mechanisms to load them and do searches in these dictionaries. Then users could download our generated dictionaries and need to load them using nspell.
With regard to the generating program -- if you tell me how you did it, I could volunteer to write such a generator in Rust, so a faster system programming language. We could then run it via push commits, to keep the lists up to date.
This would split up the work and we would be able to implement this faster? @wooorm could then provide guidance with regard to the algorithms.
What do you think?
without affixes, if I understand you correctly?
I think affixes are used during word generation, so affixes should already be taken into account when serializing parsed dictionaries, the affixes themselves are not included in the serialized dictionaries but they weigh next to nothing compared to the actual words generated (just 24kb zipped for the French one, I think the English one is 1kb or something) so they can just be downloaded too with no significant performance implications essentially.
If the complete (!) list of computed words is THAT useable
It's truly amazing, the magic data structure allows for fetching efficiently all possible ways to complete a word too, fetching known possible suffixes essentially, it might allow for fetching known prefixes too, I'm not sure.
we could ditch hunspell completely, and just write a small program that generates these compressed lists from hunspell dictionaries. We can leverage Github Actions for that.
We need something to generate the final list of words (nspell), something to pack that into the magic data structure (wizardry), something that can work directly with that packed data structure (dark wizardry), finally something implemented on top of that that allows for adding and removing words (a future version of nspell maybe, I guess it depends on which path @wooorm wants the project to take).
Then we would just need mechanisms to load them and do searches in these dictionaries.
That's all included in the magic library I tried earlier already.
With regard to the generating program -- if you tell me how you did it, I could volunteer to write such a generator in Rust, so a faster system programming language. We could then run it via push commits, to keep the lists up to date.
All the magic comes from this library: https://github.com/mckoss/dawg, computer science boys.
I wouldn't bother rewriting that in Rust honestly, it takes seconds to generate a packed dictionary for English, we can probably get it to generate packed dictionaries for any languages under 1m with some optimizations both to it and nspell. And with a rewrite subtle bugs may be introduced.
If you want to volunteer some time you should probably look into how to make that library work with arbitrary unicode words rather than just ASCII ones. Other than that there doesn't seem to be really much more work necessary I think.
This would split up the work and we would be able to implement this faster? @wooorm could then provide guidance with regard to the algorithms.
I don't currently have too much time to dedicate to this, but I absolutely want to get this done eventually because the resulting performance is incredible. So I don't know, maybe ideally @wooorm could update nspell to generate less words, @nathanlesage could patch the succinct DAWG implementation for our needs, and I could write the wrapper library that ties everything together (meaning patching nspell on writing something else)?
I would probably also want to look into using hunspell directly for words generation as well.
Also some words have some "codes" associated with them in nspell, I'm not too sure what's the deal with them, we probably shouldn't just throw them away, and we can't store them in the DAWG itself either. But that's a secondary problem we can figure out later.
Aaaaah a tree (graph) — I've dealt with these during my last project, and as I will probably be using them in my PhD, I could have it count as work time porting that thing to a system language and splitting up the work into two libraries? But we'd probably use the TS DAWG library for nspell to use them as well.
In any way, while the Math behind that looks cool, it's actually not THAT difficult to implement (we could draw on a lot of arXiv papers as the algorithm is already there). This might give us the freedom to also add non-ascii support (ergo increasing the amount of graph starting points from 26 to, idk, 128 or so)
We just need to know how the data structure of the library looks so that we can adapt the output. (Btw the reason for me insisting on a non-js program is simply because it's easier to have a command line program running on a GitHub worker, it could lift work for us to do in the future. And, I don't like utilizing JavaScript for Computational-heavy tasks due to its runtime overhead, but maybe I'm just the stereotypical German in this respect)
What do you think of the following:
Workload command line application:
Hunspell dic/aff
--> expand internally, resolving all words
--> map Capitalized and non capitalized words
and other optimizations you suggested
--> build the tree
--> save as tarball/zip (?)
Nspell:
Read the tarball
--> read in the tree
--> use for spelling correction immediately
@nathanlesage It's not just the graph, it's the combination of the graph and its compact string representation that doesn't really need to be decompressed that gives the amazing performance numbers I reported, if you didn't bother with the succinct string representation and instead required decompressing the dictionary and rebuilding the graph I don't think you will see nearly as much performance improvements, there's a good chance it may backfire even, ~2s of the ~3.3s that nspell currently takes to parse the french dictionary it's just adding words to a JS Map, I really don't think you can just do that 100x faster, you need to work with the ~compressed graph directly.
I think going with TS would be a better choice, mainly because we have a reference implementation already that looks well written well tested and well commented, we need a JS version at least for working with the string anyway because we need this to work in the browser, and a CLI wrapper for it can be made trivially.
But feel free to reimplement the packing part of it in Rust or whatever language you like, as long as we can get this thing to work I don't care if portions of the succinct DAWG library are written in Rust.
save as tarball/zip (?) Read the tarball
The plan looks good but we can't afford zipping the graph or any other form of string serialization that requires decompressing before using, the performance gains come from using the string representation directly.
Also an implementation of the DAWG that we can actually use in the browser would allow for storing custom words to add and remove from the dictionary in a memory efficient way potentially, so I don't really see any inherent advantages of using Rust other than potentially saving a few minutes once when generating these graphs.
Ah so you were just talking about the string representation itself - apologies. I'll have a look at that library myself and report back!
Potentially a Rust implementation could be useful in the future to traverse the graph in parallel using SIMD instructions in WebAssembly, but like SIMD instructions aren't here yet, it would probably be difficult to implement that on top of the string representation, and checks for presence in the graph seem instant to me anyway so any optimization in that regard may be pointless.
hunspell's unmunch command can dump all known words from a dictionary, but I think it doesn't work with non-ASCII languages.
Interestingly it seems to generate roughly the same number of words as nspell, so nspell may reimplement enough of it to be able to generate the words list we need.
hunspell's interpretation of the italian dictionary also has the same issue I mentioned with the prefixes, it generates 36 million words, which together make a ~650MB file, which zipped goes down to ~90MB, it'd be interesting to see what the gzipped version of the packed DAWG for this words list is, but that would take hours if not days to compute on my machine, most probably it's going to be unreasonably large, those problematic prefixes need to be handled specially.
It just finished generating the succinct DAWG for the first 6 million words in the italian dictionary, it ~compressed 100MB into 60kb (99.94% compression ratio), which screamed "that's buggy for sure" to me, but then again opening the text file containing those 6 million words everything looks kind of the same to me, it's a mess of different prefixes that blow up the combinatorics, plus the DAWG library currently is case insensitive so it deduplicated probably something like 75% of those words anyway.
That's to say that that DAWG library looks really like a piece of magic, and I guess perhaps having lists of words in the hundreds of megabytes might not be that big a deal actually.
Hi everyone, I'm following this issue with great interest because I would like to use the spellchecking functionality in pure javascript.
I'm the author of bibisco, a desktop app made in Electron.
Currently, I am implementing the spellchecking features using electron-spellchecker which however uses native modules, which from Electron version 9 are deprecated.
I can't use the spellchecker shipped with Electron, because it doesn't allow you to specify the language to use on Mac.
I find the @fabiospampinato's idea of having an exploded vocabulary of terms optimized using DAWG very interesting. The same idea is also implemented by the simple-spellchecker library without DAWG optimization.
However, this solution presents the problem of suggestions; how would they be implemented? In the simple-spellchecker library, a binary search with the damerau-levenshtein distance is used.
Some other notes:
- I'm Italian, so I am available in case of linguistic doubts.
- The hunspell's unmunch command works well also on UTF-8, but the command execution environment must support UTF-8. I was able to generate Russian and Czech vocabulary.
- The hunspell's unmunch command has performance problems with some vocabularies; I was unable to generate Turkish vocabulary. Any suggestions?
However, this solution presents the problem of suggestions; how would they be implemented? In the simple-spellchecker library, a binary search with the damerau-levenshtein distance is used.
You could convert the succinct DAWG into whatever tree structure simple-spellchecker is using, so this issue doesn't really exist as you can map one problem onto the other and at least one of the two is solved in your mind.
In practice though it wouldn't make sense not to use the succinct DAWG directly as that's where the magic is. In the context of the DAWG searching for suffixes means walking down the graph listing all paths to all terminal nodes (fast), searching for prefixes means walking up the graph and listing all paths leading to the staring node (it could be made fast too but requires some preprocessing to kind of invert the graph), searching for spelling suggestions in general could mean exploring the space of strings at reasonable edit distances from the query string and checking for presence in the graph (i.e. walking down the graph in search for a terminal node associated with the edited string in question, which is fast enough but probably slower than binary search), but we could also look for suggestions potentially more efficiently by walking the graph directly as we know already which paths lead to valid words (I think this could potentially be faster than naively computing edit distances and binary search on the whole list).
The hunspell's unmunch command works well also on UTF-8, but the command execution environment must support UTF-8. I was able to generate Russian and Czech vocabulary.
There's an issue on hunspell's repo about implementing another command for dumping all words, I don't think unmunch per se is slow, it extracted half a gigabyte of words in seconds in my machine from the italian dictionary.
The hunspell's unmunch command has performance problems with some vocabularies; I was unable to generate Turkish vocabulary. Any suggestions?
You can attempt to extract the words via nspell or some other somewhat compatible library.
searching for prefixes means walking up the graph and listing all paths leading to the staring node (it could be made fast too but requires some preprocessing to kind of invert the graph)
Actually, not a problem: the search algorithms on acyclic graphs work fast in both directions. I read a few things yesterday night and I think it's really fairly simple to generate and traverse such trees.
But we're still lacking a fast file format for these trees, as I still think it's wasted CPU time to recompile the tree everytime we load the app (plus the additional disk IO) -- even transistors have a limited lifetime and I'd like to efficientiate the hell out of this very, very good idea.
As for the mapping; pretty easy: as far as I know distance algorithms can give you some alternatives, and we can then traverse the tree yielding the necessary words using topological search.
An additional idea, which does not have any priority, but nevertheless: We could add a simple form of machine learning, as one instance will only be run on one computer by one user, so we could save wrongly typed words and their replacements in a hashmap, which can be offloaded and loaded using a simple function call, which can then speed up spell checking even more, if we, e.g. limit suggestions to 10 max or smth like that.