sqlite3 icon indicating copy to clipboard operation
sqlite3 copied to clipboard

ICU (Unicode) support?

Open mindplay-dk opened this issue 1 year ago β€’ 7 comments

Is there any particular reason you ship a binary without Unicode (ICU) support?

You ship with the full-text search extension, which makes sense - but how is full-text search any use without Unicode support? Literally all the text on the internet is in Unicode format.

I can understand shipping SQLite without Unicode support for things like internal databases in apps, or the system registry in your computer or phone - but (I would assume) a primary reason people are interested in SQLite in Deno, is to build microservices, or smaller/simpler web apps, and I can't imagine what those services or apps would even be doing that doesn't require Unicode support?

Maybe this is a cultural divide of some sort? But I swear, I have researched this topic everywhere, and the only explanation I can find is, the ICU extension is "big" and makes some things "slow" - and I can understand this position from SQLite itself, being designed and optimized for use as an embedded database.

But shipping this as a Deno package is very different from the SQLite default use-case, I think? Are we more concerned about being "small" and "fast", or being actually useful?

I understand it's possible to build the extension from source and load it myself, but I'm just a humble web developer, and Unicode support seems very much like something basically everyone would need?

I can't understand how people are building anything with SQLite without Unicode?

Or are they all going out on their own to solve this problem first? am I just dumb or lazy? πŸ˜…

mindplay-dk avatar Dec 26 '24 13:12 mindplay-dk

To the best of my understanding, SQLite should at least support storing Unicode UTF-8 strings?

So I attempted to run a simple query SELECT upper('abc æøΓ₯') just to see what happens, and:

Screenshot 2024-12-26 153559

I also tried running PRAGMA encoding='UTF-8' first, same result.

I then swapped to npm:libsql, which runs the query, but of course return ABC æøΓ₯, since there's no ICU support there either.

I found this article, which sums up 5 ways to handle Unicode, all of them slow or impractical in various ways. So is custom collations and bad performance what people put up with in JS? Or does everyone build their own binaries from source?

mindplay-dk avatar Dec 26 '24 15:12 mindplay-dk

I was able to reproduce your issue with a straight sql string - db.prepare("select * from table where somerow = 'πŸ¦„'").run();

However, if it is parameterized, there is no problem. db.prepare("select * from table where somerow = ?").run(["πŸ¦„"]);

So this means it is practically not a big deal.

sqlite's CLI does allow unicode literals, though, so I suppose it should be supported.

michaelmanus avatar Dec 27 '24 01:12 michaelmanus

Can confirm, it works with parameter binding. πŸ‘

sqlite's CLI does allow unicode literals, though, so I suppose it should be supported.

I think most clients allow it? I guess queries containing unicode literals might be rare - I just did it for testing. πŸ™‚

I guess I've kind of mixed two different issues into one though.

Really the main thing that baffles me is why no client ships with something as basic and essential as Unicode support?

I know ICU is "big" and makes things "slower". I also understand there are some issues with OS kernel differences, leading to practical subtleties around DB file portability. Yet, the internet runs on Unicode - everything right down to something as trivial as JSON requires Unicode.

So I just spent another half day on this, and finally came up with a tolerable solution - this relies on a lot of fragile, user-supported tooling though.

Do you think it would make sense to bundle some loadable extensions like nalgeon/text with the package? (I read that the official icu extension is "big", but this one is just ~64KB.)

Alternatively, would it make sense to integrate sqlpkg somehow?

Perhaps providing a function like e.g. getExtension("nalgeon/text") which would download and install sqlpkg locally in the project, run sqlpkg init and sqlpkg install nalgeon/text, and then return a Promise<string> with the resolved path, e.g. .sqlpkg/nalgeon/text/text.so.

Alternatively some maybe some way to literally import from e.g. @sqlpkg/nalgeon/text to get an installed package and path? Obviously that's way outside the scope of this project, I'm just airing the idea in case you find this approach preferable to including a loadable extension.

I'm just wondering how we can save the next person from a day or two of adventuring through a long list of half solutions and bad ideas before discovering how to do something as basic as just searching and sorting by Unicode strings. πŸ˜…

mindplay-dk avatar Dec 27 '24 13:12 mindplay-dk

I don't understand how it is a big deal - if you need to use ICU extension it is possible to load extensions with the library. And as for unicode encoding support in statements (not parameters - which was already there), I added that in latest commit and tests cases supporting it. It seems that it is possible to bake in ICU support to the binary but most SQLite3 clients do not seem to do that, at least by default.

A separate thing for making the extension easier to load would be good.

DjDeveloperr avatar May 27 '25 15:05 DjDeveloperr

I don't understand how it is a big deal

I don't understand how it's not πŸ˜„

literally every computer and the entire internet is unicode - I don't think I've seen an ASCII file in 10 years.

how does it make sense to ship anything without unicode support.

It seems that it is possible to bake in ICU support to the binary but most SQLite3 clients do not seem to do that, at least by default.

because SQLite was designed to be small enough to embed on extremely low spec phones and devices

that's just not the scenario if you're building something with Node

there is no JS runtime that even supports ASCII without workarounds, it's all natively unicode

yet for some reason no one ships an SQLite binary for Node that you can actually just use for anything useful out of the box

this isn't by any means a critique of your or your package, but I have long since given up on SQLite

it's just not worth the hassle 😌

on top of that, I discovered pglite which is literally postgres in a 3 MB binary

if we can have a proper database in 3 MB why would you even consider SQLite for anything that touches the web? πŸ€·β€β™‚οΈ

mindplay-dk avatar May 27 '25 18:05 mindplay-dk

Storing/retrieving unicode is there, it is actually supporting it fully in the query engine that is missing, am I getting it right? And latter can be loaded up as an extension for cases it is really needed. I'm not fully understanding the need but I'd like to learn. Could you provide some example use case that doesn't otherwise work that is commonly used by web apps for instance?

DjDeveloperr avatar May 29 '25 16:05 DjDeveloperr

Mainly case-insensitive search, and locale-aware sorting. I don't know that most apps can do without this.

Lack of unicode normalization also means that e.g. "Γ©" (U+00E9) and "e" + "Β΄" (U+0065 U+0301) can't be compared, so:

SELECT 'é' = 'é';  -- returns 0

The LIKE operator isn't aware and can't search by character classes, so:

SELECT * FROM users WHERE name LIKE '%n%';  -- works, but doesn't find strings with 'Γ±'

Operators like UPPER and LOWER don't really work either for words with accented letters, and so on.

There are practical reasons why every single database product ships with unicode support - and every single digital product in general, for that matter. Literally the entire internet and every file on your computer and most your devices use unicode.

I can't even think of a use case for a database with no unicode support, at least not in a web context.

I'd say the only place SQLite makes sense without a unicode extension is in the context of things like microcontrollers or maybe game engines or other real time, extremely performance intensive or super low resource scenarios.

It's no use in any setting involving things like people's names or text content where any kind of searching or sorting is required - which is practically any application I can think of.

may I ask what you're using it for? πŸ™‚

mindplay-dk avatar May 29 '25 17:05 mindplay-dk