gohll icon indicating copy to clipboard operation
gohll copied to clipboard

Excessive errors when inserting ~1M values

Open kokes opened this issue 6 years ago • 4 comments

Maybe I'm doing something horribly wrong, but I'm getting cardinality errors way beyond the specified boundary. All I do is that I run this simple snippet and it starts going off track at about 700k

package main

import (
	"fmt"
	"math/rand"

	"github.com/mynameisfiber/gohll"
)

func main() {
	sizes := []int{1000, 10000, 100000, 500000, 750000, 1000000, 5000000, 10000000, 25000000}
	for _, sz := range sizes {
		h, err := gohll.NewHLLByError(0.001)
		if err != nil {
			panic(err)
		}

		for i := 0; i < sz; i++ {
			h.Add(fmt.Sprintf("%d", rand.Uint32()))
		}
		c := h.Cardinality()
		fmt.Printf("expected: %v, got: %.2f, err: %.2f%%\n", sz, c, 100*(1-float64(sz)/c))
	}
}

This is what I got:

$ go run first.go 
expected: 1000, got: 1000.01, err: 0.00%
expected: 10000, got: 9999.49, err: -0.01%
expected: 100000, got: 99989.83, err: -0.01%
expected: 500000, got: 499939.97, err: -0.01%
expected: 750000, got: 1904282.61, err: 60.62%
expected: 1000000, got: 2048181.58, err: 51.18%
expected: 5000000, got: 5142333.09, err: 2.77%
expected: 10000000, got: 9987380.00, err: -0.13%
expected: 25000000, got: 24947424.06, err: -0.21%

When I bisected this around the 700k mark, I found out it breaks at the 640k mark.

$ go run first.go 
expected: 600000, got: 599886.58, err: -0.02%
expected: 610000, got: 610057.31, err: 0.01%
expected: 620000, got: 620025.35, err: 0.00%
expected: 630000, got: 629970.89, err: -0.00%
expected: 640000, got: 1842767.34, err: 65.27%
expected: 650000, got: 1848513.38, err: 64.84%
expected: 660000, got: 1853794.67, err: 64.40%
expected: 670000, got: 1859316.35, err: 63.97%
expected: 680000, got: 1865042.87, err: 63.54%
expected: 690000, got: 1871086.33, err: 63.12%
expected: 700000, got: 1876002.44, err: 62.69%
expected: 710000, got: 1881378.30, err: 62.26%
expected: 720000, got: 1887115.06, err: 61.85%
expected: 730000, got: 1892798.72, err: 61.43%
expected: 740000, got: 1898367.82, err: 61.02%
expected: 750000, got: 1904090.15, err: 60.61%

I'm not terribly familiar with the design of the library to comment any further at this point. But let me know if you need a bit more investigation into this.

kokes avatar Sep 28 '18 23:09 kokes

Hrmm... this is really interesting! I think it has to do with the transition from sparse mode to normal mode. That being said, I haven't played with this library in quite a while so I've lost a sense of at which cardinalities this switch happens for various error rates.

If you could dig a bit deeper, that would be amazing! If not, I'll try to get to it in the next couple weeks.

mynameisfiber avatar Oct 09 '18 15:10 mynameisfiber

So a couple things I've found:

  • The problem only occurs when the HLL is build with P >= 19
  • It happens when the HLL is converted from SPARSE to NORMAL mode
  • It happens with both MMH3 and FNV1a hashes
  • The encode/decode system of the sparse list seems to work even with P >= 19
  • I modified the toNormal routine to first merge in the tempList then do the whole register procedure and this doesn't fix the problem

mynameisfiber avatar Oct 19 '18 15:10 mynameisfiber

~~Actually... I take some of that back. I think the problem is when the HLL precision and the SparseList precision are too close to eachother. In this code we have the sparse precision hard coded to 25. This could cause a problem in the first branch of auxillary.decodeHash.~~

~~I'm going to play around with having a dynamically picked sparse precision... but that may have to wait a bit!~~

mynameisfiber avatar Oct 19 '18 15:10 mynameisfiber

I suspect it's really the "normal" calculation, which is only correct once the number of elements is beyond ~1M, the above example is correct below 640K, because it's calculated using the sparse method. If you force the normal method from the start, the numbers are way off from the get go.

(...)
h, err := gohll.NewHLLByError(0.001)
h.ToNormal()
(...)

Results in

$ go run err.go 
expected: 1000, got: 1513154.88, err: 99.93%
expected: 10000, got: 1517511.97, err: 99.34%
expected: 100000, got: 1561267.99, err: 93.59%
expected: 500000, got: 1766954.08, err: 71.70%
expected: 750000, got: 1904282.61, err: 60.62%
expected: 1000000, got: 2048181.58, err: 51.18%
expected: 5000000, got: 5142333.09, err: 2.77%
expected: 10000000, got: 9987380.00, err: -0.13%
expected: 25000000, got: 24947424.06, err: -0.21%```

kokes avatar Oct 21 '18 11:10 kokes