talent-plan icon indicating copy to clipboard operation
talent-plan copied to clipboard

Rust: spec how string keys are interpreted

Open brson opened this issue 6 years ago • 3 comments

In particular, keys could be canonicalized in some way before doing any comparisons, and key ranges could be specced in different ways.

I basically think that there should be no canonicalization, and ordering should be defined however std's default ordering is, and hopefully that is by code points.

Some of my rambling from slack:

There is _also_ the question of what the proper string ordering even _is_. We'll need to define that explicitly.

Since this whole project is based around utf-8 strings, and there are many ways to encode the same unicode strings and different ways to order them

We haven't specified whether the DB does any canonicalization of string keys or whether they are just treated internally as a byte vec

I think for now we can spec it all as simply treating keys as byte vecs, but a later project that gets into unicode would be very practical

actually, a scan operation that treats strings as byte vecs sounds pretty bogus to me, though maybe it "just works" in some sensible way in utf-8. I think the simple result we want is that keys are ordered by code point. (edited) 

Though in practice, since the entire index is in memory, the ordering we are going to get is whatever std's default string ordering is. Now I'm curious how that comparison is implemented.

brson avatar Jun 04 '19 20:06 brson

I'm putting this on the mvp, but it's probably ok to slip.

brson avatar Jun 04 '19 20:06 brson

cc @mapleFU

brson avatar Jun 05 '19 02:06 brson

Here's a string sorting experiment: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=99ec4fd4c0e544282aa7094b718f3ddb

They seem to be sorted by code point, then string length.

brson avatar Jun 05 '19 03:06 brson