quamina
quamina copied to clipboard
Proposal for better number representation
Posted by @arnehormann, see https://go.dev/play/p/gCzCrd3X0w1 from https://www.tbray.org/ongoing/When/202x/2024/07/09/Q-Numbers#c1720758803.307064
Should probably copy the code in, I don't know how long Go Playground posts last:
// You can edit this code! // Click here and start typing. package main
import ( "cmp" "encoding/binary" "fmt" "math" "strconv" )
func Float64ToSortable(v float64) string { const sign uint64 = 1 << 63 bin := math.Float64bits(v) if bin < sign { // positive values gain high-bit bin |= sign } else if bin != sign { // negative values have to be iverted to be ordered. // special case for -0 - it stays the same binary representation // and becomes 0 bin = ^bin } var dst [8]byte binary.BigEndian.PutUint64(dst[:], bin) return string(dst[:]) }
func main() { vals := []string{"-1.0e120", "-1.0", "-1e16", "1e16", "10000000000000000", "-1E16", "0", "-0.0", "1.0e16", "-0.1e-16", "0.1e-16"} tests := make([]float64, len(vals)) sortable := make([]string, len(vals)) for i, v := range vals { f, err := strconv.ParseFloat(v, 64) if err != nil { panic(fmt.Errorf("error test %d. with %q: %v", i, v, err)) } tests[i] = f sortable[i] = Float64ToSortable(f) } for i, ti := range tests { for j, tj := range tests { si, sj := sortable[i], sortable[j] if cmp.Compare(ti, tj) != cmp.Compare(si, sj) { fmt.Printf("Error: %q (%d) and %q (%d):\n\t%x is %v\n\t%x is %v\n", vals[i], i, vals[j], j, si, ti, sj, tj) } } } fmt.Println("no errors? All is good") }
Hi @timbray, I'll gladly try. I might need a bit of help to find the best place in the existing code. And concerning go playground: I never saw something disappearing from there. If I don't manage to do it today, I might take about two weeks. If you lack the patience, you can take over.
I'm not a Quamina user (yet; it looks great! I'm interested in your protobuf support issue and can probably help a bit there, too). But your blog post sparked my intellectual curiosity. Thanks for your posts!
Edit - if needed, you can also reach me at gmail with this login.
I guess we never thought of that sort of approach because our automaton runs entirely on UTF-8 bytes and of course the bytes of your comparable are not UTF-8. Hmm... now I see a problem. We reserve a couple of byte values that can never appear in UTF-8 for a special meaning, the central core of Quamina's automaton is in small_table.go, have a look at that.
smallTable is sufficiently non-obvious that it got a blog piece: https://www.tbray.org/ongoing/When/202x/2022/06/25/Small-Tables
Yeah. It's a string, but not UTF8. And not NaN friendly. NaN might be smaller than negative infinity or larger than positive infinity. Depending on the binary representation. Which doesn't matter for JSON, though. Also, -0 gets autoconverted to 0 and the information is lost. Otherwise, it's binary-reversible back to the original float64 and it coincidentally has the wonderful property that the zero-value will never be created, so all bytes 0 means unknown or undefined.
For the state, there is some space. It might require reshuffling and will slow the approach down, but if we can map the NaN values into the machine states and normalize NaN to a single representation, this might still work. I don't yet know how well smallTable can be adapted... this is my first dive into the code.
I think NaN and infinities aren't a problem, because all the numbers are read from JSON texts, which doesn't allow those values.
I'm just now running out of time - dang. Offline is calling... I will try to work on this, but if you want to or if you have a great idea on how to integrate it, feel free. Finding my way through what's already here and finding the time to work on it will take me a while. But if you sketch something up and if are interested in my feedback, I will find the time to provide it. From my phone, probably.
No big hurry. The performance on numbers is, roughly, "good enough". But in a FA, the elapsed time is strongly related to the length of the representation, so this is worth doing. My intuition is that your idea will probably work, and the occurrence of our reserved byte values shouldn't get in the way. Also the unit-test suite is pretty strong so we'll probably discover any problems quickly.
I have a couple of other things at the front of my queue so I'll leave this issue open for now in hopes that you get time to look into it.
(Yes, I know you're away) I was thinking about this today and I think it should probably work. Yes, we have the two distinguished values byteCeiling 0xF6 and valueTerminator 0xF5, neither of which values can appear in UTF-8. Conventionally, when we make a byte-level dispatch table (see that smallTable blog) we auto-insert byteCeiling as a sentinel value. Since your comparable numbers can have bytes with values higher than 0xF5, when constructing tables, you might have to pick a better sentinel value, up to 0xFF I guess. Which leaves the problem of how to deal with byte 0xFF itself, which can appear in your numbers. I think a tiny piece of special-case code in smallTable can probably handle this.
As for 0xF5, it's a sort of hack. For brevity, let's use "õ" (Unicode U+00F5) to represent it. When a Pattern looks like "foo", we actually store an automaton that matches "foo"õ (yes, with the quotes). Then when we start matching a JSON field value like "bar", we actually append an "õ" so it looks like "bar"õ. So matching code doesn't have to think about whether the match is at end of input on a pattern like *foo. So in practice you'd have to append a õ to the end of your comparable number.
A bit tricky, but it would make numeric equality matching run more or less twice as fast.
Now, there's the issue of numeric range matching. Get yourself a big cup of coffee and look at https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ByteMachine.java#L1195 - note that the big comment lies, wherever it says '9' it means 'F'. It seems clear we should be able to do these with your comparable numbers, but thinking about it makes my head hurt so that's all for now.
I'll note that a trivial order-preserving way to encode an arbitrary uint64 into a valid UTF-8 string is to use big endian base 128. Makes the string two bytes longer, but presumably that is not a problem for quamina. [edit] or, reading the thread more closely, perhaps it is? I haven't look at the architecture of quamina [/edit]
Memory matters, and Quamina is OK for memory until someone adds a million (literally) numeric patterns to an instance, which hasn't happened yet but, based on my experience with its predecessor, probably will happen. Then we'd be very happy to be spending 8 bytes rather than 14 or 16. So @arnehormann's approach is interesting.
But what's also interesting is the CPU cost, because the transformation has to be applied to the incoming numbers in the records being filtered, which is a very hot path. So if we get a choice between faster/smaller… it's not obvious.
@timbray To be clear, when i say "two bytes longer", I mean "than what @arnehormann is doing". The suggestion was to do the same thing, but instead of just interpreting the 8 byte uint64 as a string, to encode it into 10 bytes that are valid UTF-8.
You can go even one step further and encode them in Base 65536 into UTF-8. This is probably a close to (if not the) optimal way to encode arbitrary uint64 into UTF-8. This has the advantage of mapping smaller uint64 to shorter strings.
One step further still: @arnehormann's process maps even relatively small integer numbers into relatively large uint64 (because the fractional part always represents a number in [0,1), anything larger than 1 has a non-zero exponent, which sets high bits in the uint64. AIUI). You can alleviate that effect, by encoding the exponent and fractional part separately into UTF-8. I think you can do significantly better than even that by using some variable-length encoding into Base 65536. I'd have to think a bit about that (and "have to" here probably means my brain won't let me not do it). In any case, I think if you really want to, there are still relatively simple, far more space-efficient encodings that do what you want. You should probably be able to store most practical numbers (say, integers in the range up to ±1000) in one or two bytes.
However, if memory and CPU time really matter at that level, I'd recommend doing away with encoding things into string to begin with. Even @arnehormann's approach uses more memory by a factor of more than 3, compared to leaving numbers as float64 and just comparing those. Because it needs to keep the string header around and puts stuff on the heap, which also needs some extra bookkeeping-space. And it would use float64 comparison instructions, which will be quicker than string-comparisons. But this would require a bunch of other changes to the architecture of quamina.
My suggestion of base 128 above was basically meant as a very easy workaround to essentially do the same thing that @arnehormann does, but solve the UTF-8 compatibility problem in a cheap and easy to understand way, at very little cost.
@timbray this is what happens when one manages to nerd-snipe @Merovius - brilliant ideas, usually coupled with excellent execution. Changing the encoding will definitely free up the control bytes (illegal UTF-8 bytes) you need.
Concerning variable length encoding, I'm skeptical. The whole point is to process everything as a sequence of single bytes. It would usually work, but afaik the automata depend on the byte position to also express the numerical magnitude. Which is why the Big-Endian encoding is used at all.
Still, either Axel finds a way or he'll report back. Also for the encoding. The first byte in UTF-8 is different, after all. An optimal encoding has to consider that. And design wise, I was thinking of introducing a CompressedNumbits with a possibility to use an adapted version of Normalize which not only converts -0 and NaN but drops them and infinity all together and squashes the rest to the Lower part of the numeric range. Giving you a smaller number to encode which might in the end free up another byte...
Well, either way: good for you. That Q Numbers blogpost will probably prove to be quite valuable :)
Thought about it on my run just now.
I don't think you can do better than 10 bytes, with a fixed-length encoding, if the result is supposed to be UTF-8. Because UTF-8 has a maximum storage density of 7 bit per byte (multi-byte code points "waste" more bits, to make it self-synchronizing and give it other nice properties) and we need to store 64 bit, we are going to need 9⅐ bytes of UTF-8 on average. A tiny bit less if we ditch NaN, but not a full bit.
The good news is that if we're stuck with 10 bytes/70 bits anyways, we have 6 bits of entropy left to put compression information in, which is a ton (relatively speaking). For example, 6 bits can (by coincidence) store the bit length of the actual payload, implying that it should be possible to have a variable-length encoding that never uses more than 10 bytes (but may use significantly less). In fact, I'm 90% confident I have something suitable. And yes, @arnehormann, I'm aware of the order-restriction :) Coincidentally, this isn't the first time I needed order-preserving variable-length encodings.
The bad news is, that I find it complex/interesting enough that I might need/want to write up an explanation, before posting it :upside_down_face: Assuming it works.
Okay, I think I have to throw in the towel, this is taking too much time :) Here's the idea, in case it helps dislodge something:
First, let's handle positive numbers. We split the bits into exponent and mantissa. Encode both using base128 as above (i.e. using 7 bits per byte). That will yield 2 bytes for the exponent (11 bits = 7+4) and 8 bytes for the mantissa (52 bits = 7*7+3). So far, this sorts exactly the same, lexicographically, as taking Arne's encoding and stuffing it into base128. Also, from now on, to avoid confusion, "by7e" will mean "7 bit encoded into a byte", i.e. the top-bit is always cleared.
The theory was, that a lot of the high bits of the mantissa won't be set. So we can truncate the leading zero by7es, as they don't contain information. But this breaks the lexicographic order: 2 will encode as [2], 128 will encode as [1,0], the latter is larger, but sorts before the former. To remedy that, we put the total length (minus one) of the encoded mantissa into the top-bits of the first by7e. n-1 is at most 7, so this needs three bits. We store n-1, because the first by7e is always there. Now, the first by7e of the mantissa will be greater, if the encoding is longer. In the example above, 2 will encode as [0<<5 + 2] = [2], while 128 will encode as [1<<5+1, 0] = [33, 0] and sort higher. As the mantissa has at most 52 bits, we need 3 for the length and have 8*7=56, this gives us an order-preserving encoding that never takes more than 8 by7tes.
Now, let's look at negative numbers. Negative numbers need to do two things: 1. the encoding of any negative number must sort lower than the encoding of any non-negative number and 2. the ordering of two negative numbers must be inverted. That's the fundamental trick behind Arne's encoding. We can do the same thing: If the number is negative, we invert every by7e and clear the highest bit of the first encoded by7e. Otherwise, we set the highest bit of the first encoded by7e.
This description is a bit complicated, but I think it's not that hard to understand and theoretically should be pretty efficient to implement. Here is a bad implementation, though. It shows that the encoding works. However, I also discovered that I made a faulty assumption: In practice, the highest bytes of the mantissa usually are not zero. On the contrary: The more "integer" a float is, the higher the mantissa bits, which are set (unless it is a power of two, in which case the mantissa is 0). Which means that for >98% of integers in ±1M, the encoding uses the full 10 bytes and doesn't save significant space or time :upside_down_face:
I think that could still be saved, if one wanted to, by inverting the assumption and saying that the low bytes of the mantissa are zero and cutting those off. That means you don't even need the trick with the length: The way lexicographic ordering works, is that missing trailing elements are effectively assumed to be 0 anyways. However, I think that breaks with negative numbers, because there the trailing bytes should be interpreted to be 0xff, which doesn't jibe with the lexicographic ordering anymore. Perhaps that can be saved by instead storing the number of trailing zeros in the top bits - possibly inverted. I'm not sure. That needs some thinking and I shouldn't.
Also, after these experiments, I'm no longer sure splitting exponent and mantissa is useful. The exponent will, for small numbers, still use two bytes anyways (because of the bias its stored with) and fixing that is bound to screw with the order. And if you end up cutting trailing bits from the mantissa, it doesn't matter if you do it before or after splitting. So, if the ordering thing with trailing zeros can be figured out, I would probably just try applying that to the entire uint64.
Another random thought: If you encode NaN as the empty string, you lose the bijection, but make them order the same as cmp.Compare (i.e. all NaNs compare equal and lower than all non-NaNs).
In any case, the tl;dr: Probably just use Arne's encoding, but put it into 10 by7es :)
OK, so I think the consensus is that we should try integrating Arne's numbits representation and see how that does. If it works, we should expect two benefits: 1. use less memory (8bytes vs 14) and 2. run faster, because the automaton has to traverse maximum 8 states to match.
Are there any disadvantages? I can't think of any… there's the underlying assumption that the FA construction and traversal code in value_matcher.go and nfa.go and small_table.go is always processing UTF-8 bytes. It wouldn't surprise me if there's something there that I can't think of now that will break if that's not true. Only one way to find out.
Integrating numbits really shouldn't be a lot of work, but I'm going to wait until we land a couple of the now-outstanding PRs.
@timbray I realized that the variable length version of numbits is simpler than I thought. As you have a termination marker with f5, you just have to end when the remaining bytes are all 0.
Note: This relies on shorter sequences that are prefixes of longer ones to always be considered smaller.
Here's a sketch that should be more compact for e.g. integers stored in float64 (not yet tested, though):
func encodeFloat(f float64) []byte {
u := math.Float64bits(f)
mask := (u>>63)*^uint64(0) | (1 << 63)
num := u ^ mask
// conversions. NaNs are not handled because they are not valid JSON
if num == (1 << 63) - 1 {
// change -0.0 to 0.0 because it's equal in comparisons
num = 1 << 63
}
// big endian conversion with termination mark included
const terminationMark = 0xf5
be128 := [11]byte {
byte(num >> 57) & 0x7f, // 57 bits remaining
byte(num >> 50) & 0x7f, // 50 bits remaining
byte(num >> 43) & 0x7f, // 43 bits remaining
byte(num >> 36) & 0x7f, // 36 bits remaining
byte(num >> 29) & 0x7f, // 29 bits remaining
byte(num >> 22) & 0x7f, // 22 bits remaining
byte(num >> 15) & 0x7f, // 15 bits remaining
byte(num >> 8) & 0x7f, // 8 bits remaining
byte(num >> 1) & 0x7f, // 1 bit remaining
byte(num & 1) << 6, // last bit (high bit of lower 7 bits)
terminationMark,
}
// look for cutoff point for variable length, terminal 0 bytes can be ignored.
for i := len(be128) - 2; i >= 0; i-- {
if be128[i] != 0 {
// insert termination mark after last non-0 byte and return
be128[i+1] = terminationMark
return be128[:i+2]
}
}
// all zero
be128[0] = terminationMark
return be128[:1]
}
Interesting idea. I'm close to posting a PR with fixed-length numbers with numbits+base128.
I'd want to look at a sample of real-life numbers (not that far from 0, relatively few significant digits), and see if there are, on average, a lot of trailing zeroes.
If this works, it will always be same length or shorter than the regular base128ed numbits. In every case. It will never be longer. Essentially, it is exactly equivalent to numbits in base128 encoding. Just additionally truncate all trailing 0 bytes because they do not influence comparisons. And if I got you right, you need the 0xf5 marker anyway.
I ran the following test to study whether trailing 0 sequences are common in the kinds of numbers that typically show up in Quamina patterns:
func TestTrailingZeroesInBasicNumbers(t *testing.T) {
basics := []float64{-0.1, 0, 0.1, 2, 3, 3.00001, 123456}
for _, basic := range basics {
nb := numbitsFromFloat64(basic)
fmt.Printf("nb %d\n", nb)
}
}
The results:
nb 4631501856787818085
nb 9223372036854775808
nb 13815242216921733530
nb 13835058055282163712
nb 13837309855095848960
nb 13837309877613847097
nb 13906592281785270272
I guess the explanation is that the kinds of numbers that humans typically use are closely tied to our base-10 world, while numbits track the base-2 internal representation. So there's no obvious mapping between the way numbers look to us and the numbits representation.
but you're printing it in decimal. That won't help. Try %x instead of %d.
Oh of course, you are correct.
nb 4046666666666665
nb 8000000000000000
nb bfb999999999999a
nb c000000000000000
nb c008000000000000
nb c00800053e2d6239
nb c0fe240000000000
Now I will definitely have a close look at the variable-width numbits. First, do they preserve ordering? Because we will have queries like {"range": [ ">", 0, "<="3]} - it's possible if a bit difficult to build an automaton that matches this.
[On reflection, I guess they should. But will fuzz to be sure.]
They preserve ordering. They only rely on no more bytes being smaller or equal to a sequence of 0 bytes. Otherwise, you're good.
Resolved in #349