Add support for the FTS5 trigram tokenizer
I had added support for the newish FTS5 trigram tokenizer in my own project at some point. Upstreaming it here after some polish.
-
I kept the documentation basic, the trigram tokenizer is a bit different than the regular ones and has some caveats but probably best to leave it to the SQLite documentation to lists the details?
-
The tokenizer (currently) has two boolean arguments but only 3 combinations are valid so I implemented the options as a single enum in Swift. I hope that is OK? Feedback welcome, especially on the enum and case naming.
-
I tried to add
GRDBCUSTOMSQLITE,GRDBCIPHERconditionals and availability annotations as needed. I would welcome if you had a look if they look sensible. -
One of the tokenizer options (
remove_diacritics, e.g.TrigramTokenizerMatching.caseInsensitiveRemovingDiacritics) is available from SQLite3.45(according to an SQLite forum post, it's not mentioned in the release notes). That is newer than what Apple currently ships so this option can't currently be used with the system provided SQLite. I was wondering if you have a preferred way of handling that. For now I added an@available(*, unavailable)annotation with a message indicating that a future OS release is required. Alternatively the stdlib approach with an@available(macOS 9999, ...)annotation could be used but I feel a custom message is maybe a bit more self-explanatory. Finally we could just comment that option out (for the non-custom build) until Apple ships a release with SQLite>3.45. -
I made a minor adjustment in
FTS5Tokenizer.swiftto create a null-terminated C string when invoking a tokenizer from GRDB to work around a bug with the trigram tokenizer where it doesn't always handle non-null terminated strings correctly. The string contents were also copied before so this shouldn't perform worse.
Pull Request Checklist
- [x] CONTRIBUTING: You have read https://github.com/groue/GRDB.swift/blob/master/CONTRIBUTING.md
- [x] BRANCH: This pull request is submitted against the
developmentbranch. - [x] DOCUMENTATION: Inline documentation has been updated.
- [x] DOCUMENTATION: README.md or another dedicated guide has been updated.
- [x] TESTS: Changes are tested.
- [x] TESTS: The
make smokeTestterminal command runs without failure.
Hello @Jnosh, thank you very much for this pull request. I intended to review this week-end, but I could not. Please stay tuned.
Did some minor testing with this since it aligned with testing my own stuff.
I defined an FTS5 virtual table with trigram tokenizer using new code here, loaded 105,000 word manuscript, each paragraph its own row, and ran some MATCHes. That leaves plenty untested but it worked perfectly for me.
I had added support for the newish FTS5 trigram tokenizer in my own project at some point. Upstreaming it here after some polish.
🥳 Thank you @Jnosh!
- I kept the documentation basic, the trigram tokenizer is a bit different than the regular ones and has some caveats but probably best to leave it to the SQLite documentation to lists the details?
👍 I like it. In my review I just added a link where some details were missing.
- The tokenizer (currently) has two boolean arguments but only 3 combinations are valid so I implemented the options as a single enum in Swift. I hope that is OK? Feedback welcome, especially on the enum and case naming.
Yes, I have a feedback ;-) See the review below.
- I tried to add
GRDBCUSTOMSQLITE,GRDBCIPHERconditionals and availability annotations as needed. I would welcome if you had a look if they look sensible.
#if GRDBCUSTOMSQLITE || GRDBCIPHER is perfect. It translates into a runtime error if trigram is missing, that people will quickly catch it at runtime. On Apple platforms, we can use availability checks 👍.
- One of the tokenizer options (
remove_diacritics, e.g.TrigramTokenizerMatching.caseInsensitiveRemovingDiacritics) is available from SQLite3.45(according to an SQLite forum post, it's not mentioned in the release notes). That is newer than what Apple currently ships so this option can't currently be used with the system provided SQLite. I was wondering if you have a preferred way of handling that. For now I added an@available(*, unavailable)annotation with a message indicating that a future OS release is required.
OK. @available(*, unavailable, message: "Requires a future OS release that includes SQLite >=3.45") is perfect. See my review comments for more details.
- I made a minor adjustment in
FTS5Tokenizer.swiftto create a null-terminated C string when invoking a tokenizer from GRDB to work around a bug with the trigram tokenizer where it doesn't always handle non-null terminated strings correctly. The string contents were also copied before so this shouldn't perform worse.
Thanks for modernizing this old code 👍
@Jnosh I'm very slow to reply because I'm quite busy these days. Please be assured I did not forget this PR.
No worries!
Hi @Jnosh, and happy new year!
Sorry for blocking the PR for so long. And thank you for being patient, and explore another API. In the end, your initial suggestion (before a303308f) is just more straightforward, and arguably superior. I will happily merge the PR after a303308f is removed, if you're OK as well.