cspell-dicts icon indicating copy to clipboard operation
cspell-dicts copied to clipboard

Support both dictionaries dictionaries/en_US/en_US.trie.gz and additional_words.txt

Open vikivivi opened this issue 3 years ago • 2 comments

Every times when there is a new entry word in dictionaries/en_US/src/additional_words.txt, the dictionaries/en_US/en_US.trie.gz get rebuild and Git metadata and Git object database gets excessively inflated by 400KB due to binary file.

 dictionaries/en_US/CHANGELOG.md             |   12 ++
 dictionaries/en_US/checksum.txt             |    4 +-
 dictionaries/en_US/en_US.trie.gz            |  Bin 401990 -> 401909 bytes
 dictionaries/en_US/package.json             |    2 +-
 dictionaries/en_US/src/additional_words.txt |    2 +

Do you think cspell can support handling both dictionaries en_US.trie.gz and additional_words.txt for en_US? In this way, the Git object database will not inflated quickly. The same concept can applied to all other languages using *.trie.gz.

I understand Git object database size is not cspell dictionaries issue. But have an efficient Git objects has it own benefit.

./dictionaries/ar/src/additional_words.txt
./dictionaries/de_CH/src/additional_words.txt
./dictionaries/de_DE/src/additional_words.txt
./dictionaries/en_GB-MIT/src/additional_words.txt
./dictionaries/en_GB/src/additional_words.txt
./dictionaries/en_US/src/additional_words.txt
./dictionaries/es_ES/src/additional_words.txt
./dictionaries/fr_FR_90/src/additional_words.txt
./dictionaries/fr_FR/src/additional_words.txt
./dictionaries/nl_NL/src/additional_words.txt
./dictionaries/pt_BR/src/additional_words.txt
./dictionaries/python/src/additional_words.txt
./dictionaries/ru_RU/src/additional_words.txt
./dictionaries/sl_SI/src/additional_words.txt
./dictionaries/sv/src/additional_words.txt

vikivivi avatar Aug 11 '22 14:08 vikivivi

@vikivivi,

You make a good point.

As far as cspell is concerned, the size of the dictionary doesn't matter, but the number of dictionaries does. So, I'm reluctant to "add" more dictionaries.

Back to the original problem, binary object taking up a lot of space. Since trie files are text files, it might be worth it to just keep the .trie file instead of the .trie.gz and to compress the .gz during publication to npm. Using trie files will keep the object size smaller, even though they are bigger because it is possible to "diff" the files. .trie files are stored because they take a long time to build. .trie.gz files had been stored because they are smaller, but as you point out, since they are binary, in the long run, they take up more space.

Jason3S avatar Aug 15 '22 10:08 Jason3S

Since trie files are text files, it might be worth it to just keep the .trie file instead of the .trie.gz and to compress the .gz during publication to npm.
.... because it is possible to "diff" the files.

I support this.

vikivivi avatar Aug 15 '22 10:08 vikivivi

I'm going to close this, since moving to text files addresses the issue.

Jason3S avatar Sep 01 '22 06:09 Jason3S