AhoCorasickDoubleArrayTrie icon indicating copy to clipboard operation
AhoCorasickDoubleArrayTrie copied to clipboard

建字典树时,当词条数目超过1000000时,总是报错"OutOfMemoryError: GC overhead limit exceeded"

Open gaohang opened this issue 4 years ago • 2 comments

字典容量有什么限制吗? 机器内存是64G,内存够用应该。

gaohang avatar Jul 07 '20 07:07 gaohang

这个结构以utf16为码表,不适合储存大词典。汉字的Unicode区间为0x4E00--0x9FA5,比较分散。你可以尝试用字节做码表。

hankcs avatar Jul 07 '20 16:07 hankcs

Compared with hashmap, DAT consumes less memory. However, hashmap of 100000000 docs can be build in memory, while DAT with 10000000 docs leads to OOM?

gaohang avatar Jul 18 '20 06:07 gaohang