elasticsearch-analysis-vietnamese icon indicating copy to clipboard operation
elasticsearch-analysis-vietnamese copied to clipboard

The plugin with new C++ tokenizer

Open duydo opened this issue 3 years ago • 7 comments

From version 7.12.11, I use Cốc Cốc C++ tokenizer for the plugin instead of VnTokenizer. So I close all issues relates to VnTokenizer, I won't maintain the plugin with the VnTokenizer anymore.

The Cốc Cốc tokenizer is used in Cốc Cốc Search and Ads systems and the main goal in its development was to reach high performance while keeping the quality reasonable for search ranking needs.

If you want to use the plugin with prior versions of Elasticsearch, you can build the plugin yourself with the guide in README file.

duydo avatar Apr 22 '21 04:04 duydo

@duydo The packaged zip of the plugin does not contain the tokenizer. What is the process for installing the new tokenizer on an Elasticsearch node ?

soosinha avatar Apr 22 '21 14:04 soosinha

@soosinha The tokenizer is written in C++ so we have to build it as a shared lib on Elasticsearch node. You can refer the installation guide in README file, section: "Step 1: Build C++ tokenizer for Vietnamese library"

duydo avatar Apr 22 '21 15:04 duydo

@duydo I tried to build and run your plugin with Cốc Cốc C++ tokenizer and got crash time to time Message in log double free or corruption (fasttop) ES version: >= 7.12.1 Could you help to check when you have time, pls? I think we have problem with new tokenizer or binding from C++ to Java Thank you

=============== Updated: I checked and found that we get this problem only when create index with more than 1 shards. ES uses 1 thread for each shard and I think this C++ library is not thread safe so it causes crash

kennynguyeenx avatar Mar 22 '22 04:03 kennynguyeenx

Try with : https://github.com/miczone/coccoc-tokenizer

linuxpham avatar May 03 '22 05:05 linuxpham

@kennynguyeenx

=============== Updated: I checked and found that we get this problem only when create index with more than 1 shards. ES uses 1 thread for each shard and I think this C++ library is not thread safe so it causes crash

This issue has been fixed in this branch https://github.com/duydo/elasticsearch-analysis-vietnamese/tree/feature/search-issues

duydo avatar May 04 '22 03:05 duydo

@duydo Thank you very much

kennynguyeenx avatar May 04 '22 15:05 kennynguyeenx

Hi, @duydo We're using the elasticsearch-analysis-vietnamese plugin we keep on getting this error: 12]: *** Error in `/usr/share/elasticsearch/jdk/bin/java': double free or corruption (!prev): 0x00007f49dc04bf70 *** 12]: ======= Backtrace: ========= 12]: /lib64/libc.so.6(+0x81329)[0x7f4a8754a329] 12]: /usr/lib/libcoccoc_tokenizer_jni.so(_ZN3spp11sparsetableISt4pairIKifENS_14libc_allocatorIS3_EEE12_free_groupsEv+0x2d)[0x7f49c92e7fed]

ES version is 7.16.2

Thanks!

=========================================== Update: Had another crash and here's what journalctl -u elasticsearch.service is showing:

*** Error in `/usr/share/elasticsearch/jdk/bin/java': double free or corruption (out): 0x00007fe0e407b340 ***

A fatal error has been detected by the Java Runtime Environment:

SIGBUS (0x7) at pc=0x00007fe11e1064bc, pid=5398, tid=5772

JRE version: OpenJDK Runtime Environment Temurin-17.0.1+12 (17.0.1+12) (build 17.0.1+12) Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (17.0.1+12, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64 Problematic frame: C[thread 5592 also had an error] [thread 5775 also had an error] ======= Backtrace: ========= [libc.so.6+0x804bc]/lib64/libc.so.6(+0x81329)[0x7fe11e107329] /usr/lib/libcoccoc_tokenizer_jni.so(_ZN3spp11sparsetableISt4pairIKifENS_14libc_allocatorIS3_EEE12_free_groupsEv+0x2d)[0x7fe05c8abfed] /usr/lib/libcoccoc_tokenizer_jni.so(_ZN9Tokenizer24unserialize_nontone_dataERKSs+0x11d)[0x7fe05c8b149d] /usr/lib/libcoccoc_tokenizer_jni.so(Java_com_coccoc_Tokenizer_initialize+0x1ee)[0x7fe05c8a746e] [0x7fe10129053a] ======= Memory map: ======== 580000000-7ff700000 rw-p 00000000 00:00 0

ManWithASideQuest avatar Jul 15 '22 14:07 ManWithASideQuest