speedb
speedb copied to clipboard
Buckets / prefixes / tables: Building a triple store in speedb
Hello!
Lots of graphDBs use rocksDB as their core store, for instance TypeDB and the early versions of DGraph.
I was wondering if speedB could be a good substitute for rocksDB for those scenarios.
Most graph DBs using KVstores store their edges like this:
Key <-> Value
Edge: subject:predicate:object <-> null
Reverse edge: Key object:predicate:subject <-> null
Or like this
Edge: subject:predicate <-> Object
Reverse edge: object:predicate <-> Subject
About buckets / prefixes / tables
Some kv stores like boltDB or sled use prefixes,tables or buckets that have the potential to simplify this. I was wondering if something like this already exists in speedb?
Also, what would be the best strategy in speedb to build the foundations of an RDF database?
Example
Some examples of the triple values that i would like to store in speedB: (Subject predicate object)
**Schema**
- Person sub TYPE
- name sub ATTRIBUTE - name valuetype string #just putting two s-p-o in the same line for convenience
- familyName sub ATTRIBUTE - familyName valuetype string
- hobbies sub ATTRIBUTE - hobbies valuetype string[]
- Person owns name - Person owns familyName - Person owns hobbies
**Data**
- $p1 is Person
- $v1 is value - $v1 attributeType name - $v1 value "Ann"
- $v2 is value - $v2 attributeType familyName - $v2 value "Doe"
- $v3-1 is value - $v3-1 attributeType hobbies - $v3-1 value "Bike"
- $v3-2 is value - $v3-2 attributeType hobbies - $v3-2 value "Poker"
- $v3-3 is value - $v3-3 attributeType hobbies - $v3-3 value "Yoga"
- $p1 has $v1 - $p1 has $v2 - $p1 has $v3-1 - $p1 has $v3-2 - $p1 has $v3-3
Features that would help a lot:
- Buckets / Prefixes / Tables => So we could store the different triple types in different buckets (one bucket per predicate)
- A fast way to match all keys for a particular value (currently workarounded with indexes)
- A way to easily traverse the interconnected triples
Example of 3) "Get all instances of type Person, and their values" -> $p is Person. Person owns attributes name, familyName and hobbies, so we can check all the values that belong to those and are owned (throwgh predicate "has") by $p
Hi! Thanks for submitting your request.
We plan to work on a secondary index project that seems to offer a solution for your use case. You can see more details here https://github.com/speedb-io/speedb/issues/484 Please review and share your thoughts. In case you find this interesting, we can work together to make sure it meets your needs.
Hello and thanks for your answer,
Looks great! How would I use secondary indexes to performantly create / update and scan/get my triples (subject-predicate-object)? Would them be nestable?
The key things I like about boltDB's buckets is that
- They are intutive to use
- Using buckets as "predicates" for edges between ids showed a great performance boost over just storing keys as strings like
"subject1:predicate7"<>"myValue"
or"subjet1:predicate7:myvalue"<>null
which is how typeDB and dgraph store their triples in RocksDB/Badger - As buckets can be nestable, it enables a convinient way to store:
- List with repeated values
- Arrays with sortered values
- Tree-structured data
I have no idea about the black magic behind the performance improvement in (2), but i guess that storing predicates as strings and subjects & objects IDs as bytes for edges makes it faster than storing everything as strings.
For instance this seems to be way more performant:
Bucket ("authors")
0x1 <> "Peter"
Bucket ("books")
0x2 <> "A_book"
0x3 <> "Another_book"
Bucket("authorship")
0x1 <> [0x2,0x3]
Bucket("authorship_reverse_index)
0x2 <> [0x1]
0x3 <> [0x1]
Than this
0x1:author:Peter <> null
0x2:book:A_book <> null
0x3:book:Another_Book <> null
0x1:authorship:0x2 <> null
0x1:authorship:0x3 <> null
0x2:authorship_reverse_index:0x1 <> null
0x3 authorshiop_reverse_index:0x1 <> null
It also looks way cleaner and safer to store IDs as bytes instead of strings
Thanks @lveillard for opening this issue: There are two questions that you raise
- Using an id rather than a string. This decision is left for the user to decide. The "primary key" for each entry may be an id and the string (name) may be an attribute. Using an id is better from performance point of view on the other hand it lacks the "internal ordering" for example search for all the persons that starts with N .
- Mapping of secondary index (array vs. key per refereed object) . This is an internal design decision. We prefer using a key per refereed object. This design is more flexible and will work much better in case you have many references, as it uses the LSM power (e.g. hobby where there are some hobbies that are shared between millions of people) . We may provide (phase 2 if there is a requests) an iterator that will return multiple values for optimisation.
Thanks for your answer!
Regarding 1, the fun thing of buckets is that you get the best of both words, you can use id as keys, and get the internal ordering, as everything inside a bucket is stored together
Bucket ("books")
0x2 <> "A_book"
0x3 <> "Another_book"
vs
"0x2:book" <> "A_book"
"0x3:book" <> "Another_Book"
Regarding 2, I'm not sure to grasp the relation with buckets. Is boltDB working with secondary indexes in order to enable buckets?
The relationship between id and book require additional read (e.g. get all the books for a specific author need to get the id and than from the id the name).
In Speedb (and rocksdb) you can open a Column family that contains books or else use book as part of the key (if you put several types in the same column family) this flexibility is important for small tables.
the key of the secondary index contains the following data "secondary_key <> primary_key" . Again if this attribute is common e.g person name you can use a dedicated cf, if it is not you can add the attribute to the key. it may be important if you want to sort the data based on a different compactor for example.
Regardless when the data is saved the initial common part of the key is removed so : book-1 and book-2 will be kept as only one byte .
Thank you again for your answer @hilikspdb.
Column Families seem like a good replacement for buckets. But I guess they can't be nested like buckets?
And I think I now understood what you meant with the secondary keys. I'm doing them manually, so https://github.com/speedb-io/speedb/issues/484 looks awesome.
On the other hand, I'm still a bit confused about what would be the best way to store this kind of schema and data in Speedb? Could we use Column Families here to group predicates (sub, valueType, owns, is...)?
**Schema**
- Person sub TYPE
- name sub ATTRIBUTE
- name valuetype string
- familyName sub ATTRIBUTE
- familyName valuetype string
- hobbies sub ATTRIBUTE
- hobbies valuetype string[]
- Person owns name
- Person owns familyName
- Person owns hobbies
**Data**
- $p1 is Person
- $v1 is value
- $v1 attributeType name
- $v1 value "Ann"
- $v2 is value
- $v2 attributeType familyName
- $v2 value "Doe"
- $v3-1 is value
- $v3-1 attributeType hobbies
- $v3-1 value "Bike"
- $v3-2 is value
- $v3-2 attributeType hobbies
- $v3-2 value "Poker"
- $v3-3 is value
- $v3-3 attributeType hobbies
- $v3-3 value "Yoga"
- $p1 has $v1
- $p1 has $v2
- $p1 has $v3-1
- $p1 has $v3-2
- $p1 has $v3-3
In general the user is free to choose any schema that she want as long as you define a unique index per cf. But this is not a question for this PR and I rather answer it in a different place
@lveillard - Let's move this interesting discussion to the Speedb Discord - https://discord.com/invite/deMznf8aZZ Please open a thread on the forum - You can tag "@hilik" to follow up. Thanks
related to #484