speedb icon indicating copy to clipboard operation
speedb copied to clipboard

Buckets / prefixes / tables: Building a triple store in speedb

Open lveillard opened this issue 1 year ago • 9 comments

Hello!

Lots of graphDBs use rocksDB as their core store, for instance TypeDB and the early versions of DGraph.

I was wondering if speedB could be a good substitute for rocksDB for those scenarios.

Most graph DBs using KVstores store their edges like this:

Key <-> Value

Edge: subject:predicate:object <->  null
Reverse edge: Key object:predicate:subject <->  null

Or like this

Edge: subject:predicate <-> Object
Reverse edge: object:predicate <-> Subject

About buckets / prefixes / tables

Some kv stores like boltDB or sled use prefixes,tables or buckets that have the potential to simplify this. I was wondering if something like this already exists in speedb?

Also, what would be the best strategy in speedb to build the foundations of an RDF database?

Example

Some examples of the triple values that i would like to store in speedB: (Subject predicate object)

**Schema**
- Person sub TYPE
- name sub ATTRIBUTE - name valuetype string #just putting two s-p-o in the same line for convenience
- familyName sub ATTRIBUTE - familyName valuetype string
- hobbies sub ATTRIBUTE - hobbies valuetype string[]
- Person owns name - Person owns familyName - Person owns hobbies


**Data**
- $p1 is Person

- $v1 is value - $v1 attributeType name  - $v1 value "Ann"
- $v2 is value - $v2 attributeType familyName - $v2 value "Doe"

- $v3-1 is value  - $v3-1 attributeType hobbies - $v3-1 value "Bike" 
- $v3-2 is value  - $v3-2 attributeType hobbies - $v3-2 value "Poker"
- $v3-3 is value  - $v3-3 attributeType hobbies - $v3-3 value "Yoga"

- $p1 has $v1 - $p1 has $v2 -  $p1 has $v3-1 - $p1 has $v3-2 - $p1 has $v3-3

Features that would help a lot:

  1. Buckets / Prefixes / Tables => So we could store the different triple types in different buckets (one bucket per predicate)
  2. A fast way to match all keys for a particular value (currently workarounded with indexes)
  3. A way to easily traverse the interconnected triples

Example of 3) "Get all instances of type Person, and their values" -> $p is Person. Person owns attributes name, familyName and hobbies, so we can check all the values that belong to those and are owned (throwgh predicate "has") by $p

lveillard avatar May 16 '23 14:05 lveillard

Hi! Thanks for submitting your request.

We plan to work on a secondary index project that seems to offer a solution for your use case. You can see more details here https://github.com/speedb-io/speedb/issues/484 Please review and share your thoughts. In case you find this interesting, we can work together to make sure it meets your needs.

bosmatt avatar May 22 '23 08:05 bosmatt

Hello and thanks for your answer,

Looks great! How would I use secondary indexes to performantly create / update and scan/get my triples (subject-predicate-object)? Would them be nestable?

The key things I like about boltDB's buckets is that

  1. They are intutive to use
  2. Using buckets as "predicates" for edges between ids showed a great performance boost over just storing keys as strings like "subject1:predicate7"<>"myValue" or "subjet1:predicate7:myvalue"<>null which is how typeDB and dgraph store their triples in RocksDB/Badger
  3. As buckets can be nestable, it enables a convinient way to store:
    1. List with repeated values
    2. Arrays with sortered values
    3. Tree-structured data

I have no idea about the black magic behind the performance improvement in (2), but i guess that storing predicates as strings and subjects & objects IDs as bytes for edges makes it faster than storing everything as strings.

For instance this seems to be way more performant:

Bucket ("authors")
   0x1 <> "Peter"
Bucket ("books")
   0x2 <> "A_book"
   0x3 <> "Another_book"
Bucket("authorship")
   0x1 <> [0x2,0x3]  
Bucket("authorship_reverse_index)
   0x2 <> [0x1]
   0x3 <> [0x1]

Than this

0x1:author:Peter <> null
0x2:book:A_book <> null
0x3:book:Another_Book <> null
0x1:authorship:0x2 <> null
0x1:authorship:0x3 <> null
0x2:authorship_reverse_index:0x1 <> null
0x3 authorshiop_reverse_index:0x1 <> null

It also looks way cleaner and safer to store IDs as bytes instead of strings

lveillard avatar May 26 '23 15:05 lveillard

Thanks @lveillard for opening this issue: There are two questions that you raise

  1. Using an id rather than a string. This decision is left for the user to decide. The "primary key" for each entry may be an id and the string (name) may be an attribute. Using an id is better from performance point of view on the other hand it lacks the "internal ordering" for example search for all the persons that starts with N .
  2. Mapping of secondary index (array vs. key per refereed object) . This is an internal design decision. We prefer using a key per refereed object. This design is more flexible and will work much better in case you have many references, as it uses the LSM power (e.g. hobby where there are some hobbies that are shared between millions of people) . We may provide (phase 2 if there is a requests) an iterator that will return multiple values for optimisation.

hilikspdb avatar Jun 07 '23 09:06 hilikspdb

Thanks for your answer!

Regarding 1, the fun thing of buckets is that you get the best of both words, you can use id as keys, and get the internal ordering, as everything inside a bucket is stored together

Bucket ("books")
   0x2 <> "A_book"
   0x3 <> "Another_book"

vs

"0x2:book" <> "A_book"
"0x3:book" <> "Another_Book"

Regarding 2, I'm not sure to grasp the relation with buckets. Is boltDB working with secondary indexes in order to enable buckets?

lveillard avatar Jun 07 '23 15:06 lveillard

The relationship between id and book require additional read (e.g. get all the books for a specific author need to get the id and than from the id the name).

In Speedb (and rocksdb) you can open a Column family that contains books or else use book as part of the key (if you put several types in the same column family) this flexibility is important for small tables.

the key of the secondary index contains the following data "secondary_key <> primary_key" . Again if this attribute is common e.g person name you can use a dedicated cf, if it is not you can add the attribute to the key. it may be important if you want to sort the data based on a different compactor for example.

Regardless when the data is saved the initial common part of the key is removed so : book-1 and book-2 will be kept as only one byte .

hilikspdb avatar Jun 07 '23 16:06 hilikspdb

Thank you again for your answer @hilikspdb.

Column Families seem like a good replacement for buckets. But I guess they can't be nested like buckets?

And I think I now understood what you meant with the secondary keys. I'm doing them manually, so https://github.com/speedb-io/speedb/issues/484 looks awesome.

On the other hand, I'm still a bit confused about what would be the best way to store this kind of schema and data in Speedb? Could we use Column Families here to group predicates (sub, valueType, owns, is...)?

**Schema**
- Person sub TYPE
- name sub ATTRIBUTE 
- name valuetype string 
- familyName sub ATTRIBUTE
- familyName valuetype string
- hobbies sub ATTRIBUTE
- hobbies valuetype string[]
- Person owns name
- Person owns familyName
- Person owns hobbies


**Data**
- $p1 is Person

- $v1 is value
- $v1 attributeType name 
- $v1 value "Ann"

- $v2 is value
- $v2 attributeType familyName
- $v2 value "Doe"

- $v3-1 is value 
- $v3-1 attributeType hobbies
- $v3-1 value "Bike" 
- $v3-2 is value 
- $v3-2 attributeType hobbies
- $v3-2 value "Poker"
- $v3-3 is value
- $v3-3 attributeType hobbies
- $v3-3 value "Yoga"

- $p1 has $v1
- $p1 has $v2
- $p1 has $v3-1
- $p1 has $v3-2
- $p1 has $v3-3

lveillard avatar Jun 07 '23 23:06 lveillard

In general the user is free to choose any schema that she want as long as you define a unique index per cf. But this is not a question for this PR and I rather answer it in a different place

hilikspdb avatar Jun 08 '23 04:06 hilikspdb

@lveillard - Let's move this interesting discussion to the Speedb Discord - https://discord.com/invite/deMznf8aZZ Please open a thread on the forum - You can tag "@hilik" to follow up. Thanks

Guyme avatar Jun 08 '23 07:06 Guyme

related to #484

Guyme avatar Jan 02 '24 10:01 Guyme