succinct icon indicating copy to clipboard operation
succinct copied to clipboard

succinct static kv

succinct

Travis test

Report card Coverage Status

GoDoc PkgGoDev Sourcegraph

succinct provides several static succinct data types

  • Succinct Set
    • Synopsis
    • Performance
    • Implementation
  • License

Succinct Set

中文介绍: 100行代码的压缩前缀树: 50% smaller

Set is a succinct, sorted and static string set impl with compacted trie as storage. The space cost is about half lower than the original data.

Synopsis

package succinct

import "fmt"

func ExampleNewSet() {
	keys := []string{
		"A", "Aani", "Aaron", "Aaronic", "Aaronical", "Aaronite",
		"Aaronitic", "Aaru", "Ab", "Ababdeh", "Ababua", "Abadite",
	}
	s := NewSet(keys)
	for _, k := range []string{"Aani", "Foo", "Ababdeh"} {
		found := s.Has(k)
		fmt.Printf("lookup %10s, found: %v\n", k, found)
	}

	// Output:
	//
	// lookup       Aani, found: true
	// lookup        Foo, found: false
	// lookup    Ababdeh, found: true
}

Performance

  • 200 kilo real-world words collected from web:

    • the space costs is 57% of original data size.
    • And a Has() costs about 350 ns with a zip-f workload.

    Original size: 2204 KB

    With comparison with string array bsearch and google btree :

    Data Engine Size(KB) Size/original ns/op
    200kweb2 bsearch 5890 267% 229
    200kweb2 succinct.Set 1258 57% 356
    200kweb2 btree 12191 553% 483

    A string in go has two fields: a pointer to the text content and a length. Thus the space overhead is quite high with small strings. btree internally has more pointers and indirections(interface).

  • 870 kilo real-world ipv4:

    • the space costs is 67% of original data size.
    • And a Has() costs about 500 ns with a zip-f workload.

    Original size: 6823 KB

    Data Engine Size(KB) Size/original ns/op
    870k_ip4_hex bsearch 17057 500% 276
    870k_ip4_hex succinct.Set 2316 67% 496
    870k_ip4_hex btree 40388 1183% 577

Implementation

It stores sorted strings in a compacted trie(AKA prefix tree). A trie node has at most 256 outgoing labels. A label is just a single byte. E.g., [ab, abc, abcd, axy, buv] is represented with a trie like the following: (Numbers are node id)

^ -a-> 1 -b-> 3 $
  |      |      `c-> 6 $
  |      |             `d-> 9 $
  |      `x-> 4 -y-> 7 $
  `b-> 2 -u-> 5 -v-> 8 $

Internally it uses a packed []byte and a bitmap with len([]byte) bits to describe the outgoing labels of a node,:

^: ab  00
1: bx  00
2: u   0
3: c   0
4: y   0
5: v   0
6: d   0
7: ø
8: ø
9: ø

In storage it packs labels together and bitmaps joined with separator 1:

labels(ignore space): "ab bx u c y v d"
label bitmap:          0010010101010101111

In this way every node has a 0 pointing to it(except the root node) and has a corresponding 1 for it:

                               .-----.
                       .--.    | .---|-.
                       |.-|--. | | .-|-|.
                       || ↓  ↓ | | | ↓ ↓↓
labels(ignore space):  ab bx u c y v d øøø
label bitmap:          0010010101010101111
node-id:               0  1  2 3 4 5 6 789
                          || | ↑ ↑ ↑ |   ↑
                          || `-|-|-' `---'
                          |`---|-'
                          `----'

To walk from a parent node along a label to a child node, count the number of 0 upto the bit the label position, then find where the the corresponding 1 is:

childNodeId = select1(rank0(i))

In our impl, it is:

nodeId = countZeros(ss.labelBitmap, ss.ranks, bmIdx+1)
bmIdx = selectIthOne(ss.labelBitmap, ss.ranks, ss.selects, nodeId-1) + 1

Finally leaf nodes are indicated by another bitmap leaves, in which a 1 at i-th bit indicates the i-th node is a leaf:

leaves: 0001001111

License

This project is licensed under the MIT License - see the LICENSE file for details.