dovecot-fts-flatcurve
dovecot-fts-flatcurve copied to clipboard
Segfault with Dovecot 1:2.3.21+dfsg1-2 from Debian
Hi,
It's been a few days now since I haven't been able to use flatcurve due to the error below:
dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=51735; InvalidArgumentError: Term too long (> 245): _znst10_hashtablei7qstringst4pairiks0_iesais3_enst8__detail10_select1stest8equal_tois0_est4hashis0_ens5_18_mod_range_hashingens5_20_default_ranged_hashens5_20_prime_rehash_policyens5_17_hashtable_traitsilb0elb0elb1eeee9_m_rehashe{size_t}rk{size_t}@ba
dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=51737; InvalidArgumentError: Term too long (> 245): 9qtprivate18qfunctorslotobjectist5_bindifmn5qcoro6detail17waitoperationbasei10qtcpservereefvnst7__n486116coroutine_handleiveeepns3_14qcorotcpserver29waitfornewconnectionoperationes9_eeli0ens_4listijeeeve4impleipns_15qslotobjectbaseep7qobjectppvpb@
dovecot[3384677]: indexer-worker: Error: terminate called after throwing an instance of 'std::bad_alloc'
dovecot[3384677]: indexer-worker: Error: what(): std::bad_alloc
Mar 08 17:07:06 paluero dovecot[3384677]: imap(user)<3384697><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB>: Error: Mailbox INBOX: indexer failed to index mailbox
Mar 08 17:07:06 paluero dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Fatal: master: service(indexer-worker): child 3384699 killed with signal 6 (core dumped)
I haven't had the chance to investigate what's going on yet, but maybe this is a known issue?
You may have interrupted indexing while indexing on this user. I would delete the entire fts-flatcurve folder in the user's mailbox and start indexing again. I had something similar when I forcibly interrupted indexing.
Duplicate of #44
Somebody needs to provide a testcase to reproduce, since I can't.
This is the whole point of https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_max_term_size so I'm not sure how this could happen. The size of a term defaults to 30 characters max, and is hardcoded to never be more than 200 characters. So no idea how something larger can be indexed.
You may have interrupted indexing while indexing on this user. I would delete the entire fts-flatcurve folder in the user's mailbox and start indexing again. I had something similar when I forcibly interrupted indexing.
Thanks. I tried deleting all fts-flatcurve
directories from my ~/Mail
dir, and then reissued a search, but I still see the problem.
Duplicate of #44
Somebody needs to provide a testcase to reproduce, since I can't.
This is the whole point of https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_max_term_size so I'm not sure how this could happen. The size of a term defaults to 30 characters max, and is hardcoded to never be more than 200 characters. So no idea how something larger can be indexed.
Right. Here's my 90-fts.conf
:
mail_plugins = $mail_plugins fts fts_flatcurve
plugin {
fts = flatcurve
fts_enforced = yes
fts_autoindex = yes
fts_languages = en pt
fts_tokenizers = generic email-address
fts_filters = lowercase normalizer-icu
fts_flatcurve_max_term_size = 30
fts_flatcurve_substring_search = yes
}
As I said above, I can reproduce the problem pretty easily on my mail directory, but unfortunately I don't know if there's another way to do it without having to provide my personal messages :-/.
Does removing 'fts_flatcurve_max_term_size = 30' from your config help?
Unfortunately not. I still see the segmentation fault happening.
This is a partial backtrace:
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007fd67c8781cf in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x00007fd67c82a472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007fd67c8144b2 in __GI_abort () at ./stdlib/abort.c:79
#4 0x00007fd67baa0a2d in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#5 0x00007fd67bab1f5a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#6 0x00007fd67baa05d9 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#7 0x00007fd67bab21d8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7fd67bc54bc0 <typeinfo for std::bad_alloc>, dest=0x7fd67bab0510 <std::bad_alloc::~bad_alloc()>)
at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:98
#8 0x00007fd67baa0649 in operator new (sz=sz@entry=96) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:54
#9 0x00007fd679882575 in std::__new_allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::allocate (
this=<optimized out>, __n=1) at /usr/include/c++/13/bits/new_allocator.h:151
#10 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > > >::allocate (__n=1, __a=...)
at /usr/include/c++/13/bits/alloc_traits.h:482
#11 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_get_node (this=0x56355cbdd140)
at /usr/include/c++/13/bits/stl_tree.h:563
#12 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_create_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_tree.h:613
#13 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_Auto_node::_Auto_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__t=..., this=<synthetic pointer>) at /usr/include/c++/13/bits/stl_tree.h:1637
#14 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_emplace_hint_unique<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (this=this@entry=0x56355cbdd140, __pos=__pos@entry={...}) at /usr/include/c++/13/bits/stl_tree.h:2462
#15 0x00007fd679881a46 in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::emplace_hint<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__pos=..., this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_map.h:638
#16 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::insert<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__x=..., this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_map.h:860
#17 Xapian::Document::Internal::add_term (this=0x56355cbdd0d0, tname=..., wdfinc=<optimized out>) at ../api/omdocument.cc:502
#18 0x00007fd679881b0b in Xapian::Document::add_term (this=<optimized out>, tname=..., wdfinc=<optimized out>) at ../api/omdocument.cc:146
#19 0x00007fd67c9ee261 in fts_flatcurve_xapian_index_body (ctx=ctx@entry=0x563559ba0458, data=<optimized out>, size=<optimized out>) at fts-backend-flatcurve-xapian.cpp:1308
#20 0x00007fd67c9e84fa in fts_backend_flatcurve_update_build_more (_ctx=0x563559ba0458, data=<optimized out>, size=<optimized out>) at fts-backend-flatcurve.c:328
#21 0x00007fd67c741709 in fts_build_add_tokens_with_filter (ctx=ctx@entry=0x7ffc684e1740,
data=data@entry=0x56355a5e5400 "\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty17setGeometryColumnERK7QString@Base 3.10.2\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty20setPrimaryKeyColumnsERK11QStringList"..., size=size@entry=8138) at /build/reproducible-path/dovecot-2.3.21+dfsg1/src/plugins/fts/fts-build-mail.c:273
#22 0x00007fd67c741958 in fts_build_tokenized (last=false, size=8138,
data=0x56355a5e5400 "\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty17setGeometryColumnERK7QString@Base 3.10.2\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty20setPrimaryKeyColumnsERK11QStringList"..., ctx=0x7ffc684e1740) at /build/reproducible-path/dovecot-2.3.21+dfsg1/src/plugins/fts/fts-build-mail.c:349
#23 fts_build_data (ctx=ctx@entry=0x7ffc684e1740,
This crash indicates memory allocations is causing out-of-memory errors. You've increased memory for the indexer from the default?
Otherwise, not very useful as all the function data has been optimized out. You can try compiling again with optimization flags to see if that provides more info in the stack trace.
Really would like to know if it is a particular message that is causing the crash, or if you are simply hitting memory errors because your messages are so large. (Nothing you can do there except increase memory or reduce shard sizes. In out-of-memory case, a segfault is expected and the correct behavior.)
If by memory limit you mean the fts_flatcurve_commit_limit
option, then yes, I'm using a value of 5000
.
I've been meaning to recompile and continue debugging the problem further but unfortunately I don't have the time right now. Maybe on the weekend. I'll provide more info when I have it.
Try something lower, like 1000.
If you have large messages (with lots of indexing data), Xapian can use more than 256MB (default vsz_limit) of memory, which will cause out-of-memory issues. A lower number will ensure that less memory is used before the data is swapped to disk, at the expense of additional I/O.
Actually, commit_limit might be the even better setting to try a lower value.
https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_commit_limit
I've had a similar problem
This is the errors
Apr 01 23:04:41 indexer-worker([email protected])<31534><8JOINDQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Lixeira): Could not write message data: uid=58763; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/cancelaraviso/1853889ed111cdeac6a85a4a58191e0d3f9ed30553562d76a8fd0e4536e9e105641dd4543d3dcf835d04e911bfc64949b5350821ce593f792cde1c2d1b2b1da2/njjjnmqwztmtnjayyy00mjvkltk0otatmte2ytg0mwe3m2u5.html?email=user@domain.com.br
Apr 01 23:05:24 indexer-worker([email protected])<31534><CFzBJWQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Enviadas): Could not write message data: uid=4446; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/cancelaraviso/0bec6827d1b4984841266bf56c7ecd6b6f554679213f8826abeedd6ba9e3b6556d088a98515f715b0e899112bd75a4080959fb6b779e89d30337130871fbbe39/yjvlntqzngytmjnjmi00ndi4lwi5ztitogizyzhmntzjmwu0.html?email=user@domainc.com.br
Apr 01 23:05:25 indexer-worker([email protected])<31534><CFzBJWQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Enviadas): Could not write message data: uid=4560; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/viewblob/959e8ee84f7e63458bb0cb91167ce58e865b21b14a0c0c03fbdb8d28f3322f7fbcc0ec8716c8a3b8d3083ef7bba2fcd4b8764c42472ecd30100b74223ab15ffa/zdgwzthhy2etmduzmy00ngfhlwjiogqtzwvjzmq0mwezmjyz.html?email=user@domaind.com.br
Apr 01 23:05:59 indexer-worker([email protected])<31534><IDKFN4YdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(INBOX/CONTRATOS): Could not write message data: uid=17; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/viewblob/b5fbfdb70178e658c46cfa532031624305195d3c2780c6033d6c65f8a16f960557918aa5e9f679deb27c0b531df0107b6d6c59f5f86405e6b397efe9d0481dec/ntg0ngexnmitmje4yi00zguxlthintqtztdlodnlzta0mwjm.html?email=user@domainf.com.br
Here is my 90-fts.conf
plugin {
fts = flatcurve
fts_autoindex = yes
fts_enforced = yes
fts_languages = pt en es
fts_tokenizer_generic = algorithm=simple
fts_tokenizers = generic email-address
fts_filters = normalizer-icu lowercase stopwords
fts_filters_en = lowercase snowball english-possessive stopwords
fts_flatcurve_commit_limit = 50
fts_flatcurve_max_term_size = 30
fts_flatcurve_min_term_size = 2
fts_flatcurve_substring_search = no
fts_index_timeout = 60s
fts_header_excludes = *
fts_header_includes = Date From To Cc Bcc Subject Content-Type
fts_autoindex_max_recent_msgs = 100
}
I think that Xapian::Utf8Iterator::raw()
doesn't work as expected. The following is with Xapian 1.4.22 on Debian 12.
I have fts_flatcurve_max_term_size
set to 20:
doveconf | grep fts_flatcurve_max_term_size
fts_flatcurve_max_term_size = 20
Test mail (no headers, just a body):
gesangsvereinsbuchausleihgesellschaft
1234567890123456789012345678901234567890
I added some debug logging in fts_flatcurve_xapian_index_body()
:
diff --git a/src/fts-backend-flatcurve-xapian.cpp b/src/fts-backend-flatcurve-xapian.cpp
index 77d8aaa..bb62041 100644
--- a/src/fts-backend-flatcurve-xapian.cpp
+++ b/src/fts-backend-flatcurve-xapian.cpp
@@ -1299,6 +1299,9 @@ fts_flatcurve_xapian_index_body(struct flatcurve_fts_backend_update_context *ctx
do {
std::string t (ustr.raw());
+ i_debug("data=%s size=%lu t=%s ustr.raw()=%s ustr.left()=%lu",
+ (const char *)data, (unsigned long)size, t.c_str(), ustr.raw(), (unsigned long)ustr.left());
+
/* Capital ASCII letters at the beginning of a Xapian term are
* treated as a "term prefix". Check for a leading ASCII
* capital, and lowercase if necessary, to ensure the term
Delivering the mail:
doveadm save -u user < mail
Dumping the index:
doveadm fts-flatcurve dump -u user INBOX
123456789012345678901234567890 count=1
gesangsvereinsbuchausleihgesel count=1
It saved 30 characters for both words, not 20.
Debug logging:
Debug: data=gesangsvereinsbuchausleihgesel size=20 t=gesangsvereinsbuchausleihgesel ustr.raw()=gesangsvereinsbuchausleihgesel ustr.left()=20
Debug: data=123456789012345678901234567890 size=20 t=123456789012345678901234567890 ustr.raw()=123456789012345678901234567890 ustr.left()=20
So Dovecot core sent 30 characters (data
).
ustr = Xapian::Utf8Iterator((const char *)data, size);
This limits the iterator to 20 characters (ustr.left()
), but not the raw data of the iterator (ustr.raw()
).
I don't know where the limit of 30 characters from Dovecot core comes from. Maybe there is a scenario where Dovecot core sends more than 30 characters, more than 245 even?
Thank you for debug help @edieterich ... you are correct that Utf8Iterator usage does not appear to be correct and that should be looked at.
...but with that being said, this is still irrelevant for purposes of debugging this ticket. As you noted, it's not just flatcurve that does max term limitation. It's also tokenization. By default, generic tokenizer limits to 30-byte long tokens (see maxlen
):
https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin_setting-fts-fts_tokenizers
So by the time flatcurve processes the string, there have been 2 limitation gates that prevent it from being more than 30 bytes long.
If I set fts_tokenizer_generic = algorithm=simple maxlen=500
, I can trigger a too long error:
May 03 17:41:00 indexer-worker(user)<357><GKgOOZAXSqR/AAAB:uDz6EKwhNWZlAQAAxGKukQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=4; InvalidArgumentError: Term too long (> 245): aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
But this is expected behavior, and no crash/segfault. (This is on the test Debian image, which is currently bullseye-slim).
So I see two things that can be improved, but don't have anything to do with segfaults:
- Remove fts_flatcurve_max_term_size as it is just duplicative of the tokenizer setting
- Fix Utf8Iterator usage. We don't need it for max determination anymore - we just need it to get the minimum number of characters.
Only thing I can think of to additionally investigate is that maybe Dovecot fts tokenizer is potentially splitting UTF-8 character when it enforces max length, and maybe that causes problems? But that would be a core issue, not a flatcurve one.
First, answering my own question, but the generic tokenizer IS UTF-8 aware and will correctly handle a split UTF-8 character at the split point.
Also, it turns out the Dovecot team already removed use of Utf8Iterator in the new 2.4 code similarly to how I changed things here.
Anyway, maybe somebody can test this new code. Since I can't reproduce the original issue, I have no idea whether this helps or not.
This code was committed, and a new release was pushed almost a month ago. Haven't heard any response in this ticket, so the assumption is that these changes fixed the issue(s). Closing ticket.
I just tried compiling the master
branch, and unfortunately I'm seeing seeing errors when indexing my INBOX:
May 29 20:44:43 dovecot[2965329]: indexer-worker: Error: terminate called after throwing an instance of 'std::length_error'
May 29 20:44:43 dovecot[2965329]: indexer-worker: Error: what(): basic_string::_M_create
May 29 20:44:43 dovecot[2965329]: imap(user)<2965343><17URJqEZ3IoAAAAAAAAAAAAAAAAAAAAB>: Error: Mailbox INBOX: indexer failed to index mailbox
May 29 20:44:43 dovecot[2965329]: indexer-worker(user)<2965345><17URJqEZ3IoAAAAAAAAAAAAAAAAAAAAB:fI6PGJPLV2ZhPy0AquyQZQ>: Fatal: master: service(indexer-worker): child 2965345 killed with signal 6 (core dumped)
The message is different than the one I was seeing before, but the outcome is still a segfault.
The generic tokenizer doesn't always respect maxlen and UTF-8 character boundaries. Take the attached crash_data.txt file. I'm not sure what this is, but it's part of a mail that crashed Flatcurve with same error as sergiodj's.
maxlen is the default, 30 characters:
doveadm fts tokenize 1234567890123456789012345678901234567890
123456789012345678901234567890
With crash_data.txt I get a token longer than 30 characters:
doveadm fts tokenize "$(cat crash_data.txt)"
������6���
�����
����������
������6�����������������������������?�����@������������������������������
To crash Flatcurve, configure substring search, otherwise it doesn't crash:
pluging {
fts_flatcurve_substring_search = yes
}
Now save crash_data.txt as a header or a body to crash Flatcurve in fts_flatcurve_xapian_index_header
or fts_flatcurve_xapian_index_body
:
echo "Subject: $(cat crash_data.txt)" | sudo doveadm save -u user
echo -e "\r\n$(cat crash_data.txt)" | sudo doveadm save -u user
I added some debug logging in fts_flatcurve_xapian_index_body
just before and after size -= csize;
:
...
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 200
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 197
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 197
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 194
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
...
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 2
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 4294967295
Jun 19 16:45:17 indexer-worker: Error: terminate called after throwing an instance of 'std::length_error'
Jun 19 16:45:17 indexer-worker: Error: what(): basic_string::_M_create
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Fatal: master: service(indexer-worker): child 345624 killed with signal 6 (core dumped)
So I get a 200 characters long token (instead of the expected maxlen 30) that is truncated at the last 3-byte UTF-8 character after 2 bytes, leading to an unsigned integer underflow, leading to a crash.
I don't think there's much you can do except for preventing the unsigned integer underflow. The real problem is in the tokenizer.
Splitting the last multibyte UTF-8 character is caused by fts_backend_flatcurve_update_build_more
:
size = I_MIN(size, FTS_FLATCURVE_MAX_TERM_SIZE);
Update: I can confirm that this is a bug in Dovecot core code (specifically the FTS tokenization code).
For the generic tokenizer, it doesn't happen for ALL large strings - it seems to require certain input that confuses the UTF-8 parsing code in core. It will correctly tokenize the input to a certain point, but then the last token is the entire string instead of the remaining text. So somehow the string pointer is being reset.
As #67 shows, this can also be triggered by long email addresses, since that tokenizer doesn't have a size limit.
Since we are getting close to releasing 2.4.0, I want to fix these issues in actual Dovecot core code first, so that the flatcurve code now living in core doesn't have to be changed. I will look at @edieterich #65 as a workaround for the 2.3 branch.
It turns out that core is behaving as expected.
The problem here is actually the "email-address" tokenizer, not the "generic" tokenizer.
The email tokenizer, by default, has a maximum length of 254 characters. So this is where the excessively sized tokens are being generated.
It turns out the the email tokenizer also supports the "maxlen" parameter as well, even though this is not documented (I will fix this for 2.4 documentation). Thus, a workaround would be to set a max length for email-addresses, such as:
fts_tokenizer_email_address = maxlen=60
Indexing 200+ character email addresses is probably not all that useful, so this should probably be the recommended configuration anyway.
I will likely adapt #65 , although I would like to emit a debug message when the token is truncated for debugging purposes.
Now confirmed fixed, with CI test added.