stork
stork copied to clipboard
Main thread panic in stork search "not a char boundary"
With the following input file (named gallery.toml)
[output]
displayed_results_count = 31
[input]
url_prefix = "/gallery/"
frontmatter_handling = "Omit"
stemming = "None"
minimum_indexed_substring_length = 4
files = [
{ url = "010695", title = "(Untitled)", contents = "caption saint–aubin–fosse–louvain then gules a chevron argent between 3 eagles or. caption saint–berthevin then (1999) per chief per pale gules and argent; and sable an eagle arg beaked and membered or in dexter side taller a lion sable crowned gu armed and langued gu in sinister side shorter shorter a demi lion arg in chief taller taller lower. caption saint–berthevin–la–tanniere then (1999) or a chevron gules 2 eagles in chief azure a tree eradicated vert in middle %base higher. caption saint–charles–la–foret then gules a carbuncle or. caption 'saint–denis–d'anjou' ", filetype="PlainText" },
]
We run the command:
stork build --input gallery.toml --output gallery.st
And then try a command line search for a known hit, e.g.
stork search --format json --index gallery.st --query "azure"
We get the message:
thread 'main' panicked at 'byte index 540 is not a char boundary; it is inside '–' (bytes 539..542) of `caption saint–aubin–fosse–louvain then gules a chevron argent between 3 eagles or. caption saint–berthevin then (1999) per chief per pale gules and argent; and sable an eagle arg beaked and membered or in dexter side taller a lion sable crow`[...]', stork-lib/src/index_v4/search/excerpt_grouping.rs:158:19
stack backtrace:
0: rust_begin_unwind
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
2: core::str::slice_error_fail_rt
3: core::str::slice_error_fail
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/str/mod.rs:86:9
4: stork_lib::index_v4::search::render_search_values
5: stork::main
(Byte 540 is just before the final "gules" in the content string)
Other information:
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ stork --version
Stork 2.0.0-beta.2
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ file /usr/local/bin/stork
/usr/local/bin/stork: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=2b673e767e82fea952b7220d047e5fe187d91b27, for GNU/Linux 3.2.0, with debug_info, not stripped
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ uname -a
Linux DESKTOP-9DUHI21 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
(So this is Ubuntu Linux running under WSL2, although I can reproduce on a native Ubuntu installation also)
Please let me know if you need anything else. Hope this is useful!
Further investigation suggests that the things that look like '-' are not ASCII, removing them solves the problem so this is likely something related to character mapping.
It is \u2013 that seems to cause the problem.
Actually everything non-ASCII in the input file seems to cause a problem with the command line search hits. In PHP,
iconv("UTF-8", "ASCII//TRANSLIT", $content);
Fixes the problem
This may even be documented somewhere so I'll shut up now...