datascript icon indicating copy to clipboard operation
datascript copied to clipboard

VAET index

Open darkleaf opened this issue 5 years ago • 10 comments

Why the VAET index wasn't implemented? Will be it useful to implement it?

darkleaf avatar Jun 02 '20 09:06 darkleaf

For all use-cases AVET covers what VAET does. I’m not sure what’s its purpose in Datomic is. If you have a use-case, I might consider adding it. Otherwise, it’ll just be a performance penalty

tonsky avatar Jun 02 '20 12:06 tonsky

VAET is only necessary when you don't know which reverse reference-type attributes might be applicable for a given entity. For a small DataScript-sized system this is probably never an issue. At a larger scale, with potentially millions of attributes, you need VAET in order to do explorative navigation of the graph efficiently (when you can't be sure which reverse attributes might be relevant beforehand). Without VAET you have to scan through all the possible combinations in AVET.

refset avatar Jun 28 '20 16:06 refset

This only arises when deleting entities, right?

tonsky avatar Jun 28 '20 20:06 tonsky

I'm saying that if you have many different :db.type/ref attributes then you can't query [?e _ e123] without naively scanning through all combinations in AVET.

I don't know exactly how VAET would be wired up internally, but I would expect to see something else more advanced than the EAVT filter here: https://github.com/tonsky/datascript/blob/6e80073355ef35bf0b0d94afd3dd5d0f97ded2c1/src/datascript/db.cljc#L504

Again, I doubt most users would be storing enough attributes or overall data in DataScript to warrant changing anything. Most of the time people know exactly which attributes they need in their queries.

For the curious, this is where near enough the same kind of scanning occurs for deleting entities: https://github.com/tonsky/datascript/blob/6e80073355ef35bf0b0d94afd3dd5d0f97ded2c1/src/datascript/db.cljc#L1278

refset avatar Jun 28 '20 23:06 refset

In the Datalog-derived system I built, VAET was used to determine which records had a particular attribute-value pair, where the cardinality of the attribute was low; I found it to be essential in deletion.

But this was for a multi-terabyte database with billions of tuples.

On Jun 28, 2020, at 7:14 PM, Jeremy Taylor [email protected] wrote:

 I'm saying that if you have many different :db.type/ref attributes then you can't query [?e _ :some-id] without naively scanning through all combinations in AVET.

I don't know exactly how VAET would be wired up internally, but I would expect to see something else more advanced than the EAVT filter here: https://github.com/tonsky/datascript/blob/6e80073355ef35bf0b0d94afd3dd5d0f97ded2c1/src/datascript/db.cljc#L504

Again, I doubt most users would be storing enough attributes or overall data in DataScript to warrant changing anything. Most of the time people know exactly which attributes they need in their queries.

For the curious, this is where near enough the same kind of scanning occurs for deleting entities: https://github.com/tonsky/datascript/blob/6e80073355ef35bf0b0d94afd3dd5d0f97ded2c1/src/datascript/db.cljc#L1278

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wbrown avatar Jun 29 '20 00:06 wbrown

[?e _ e123]

That would require VAET, of course, but I can’t imagine an use-case where anybody would need that type of query. The only use-case I come up with is deletion, and it’s a tradeoff — faster deletions or faster insertions (one less index to fill).

tonsky avatar Jun 29 '20 13:06 tonsky

Yeah, it's certainly a tradeoff. I suspect that without VAET the deletion time cost might be too unpredictable at a large-enough scale to maintain consistent throughput, whereas the space (+ time) costs of inserting into VAET are amortised which makes it a much safer default for a highly-distributed production system like Datomic.

I think the non-deletion use-cases can be described as "exploratory" queries, where your data set is large (many attributes) and not well understood (can't predict which attributes might be relevant). I imagine Roam's knowledge graph "backlinks" feature might be a good use-case for benefiting from a native VAET (backlink!) index, if such a graph ever grew large enough (i.e. multi-user).

Incidentally, I just came across these discussions that mention Datomic's ability to do wildcard reverse lookups with the Entity API via (keys (.cache (.touch ent))):

  • https://stackoverflow.com/questions/15629085/in-datomic-how-is-it-possible-to-find-out-what-keys-are-available-for-reverse-l
  • https://stackoverflow.com/questions/14189647/get-all-fields-from-a-datomic-entity
  • (admittedly I can't see this behaviour officially documented anywhere, and I've not confirmed that it works with my own eyes)

Providing a similar index to VAET is something that the Crux team has been contemplating recently, motivated by our work on a generic entity navigation UI. Crux avoids the deletion issue by choosing not to enforce referential integrity, i.e. there are no explicit reference-type attributes, only values and IDs that might happen to correspond.

I'll leave it at that for my contributions to this thread but hopefully it's useful context for some. I remember thinking about this very question several years ago, so it's good to have finally written it down :)

refset avatar Jun 29 '20 17:06 refset

Here is a use case that produces potentially large numbers of attributes and in which a query of the form of [?es _ e123] is relevant:

One way to import large vectors (for example an ordered sequence of events) is to use “numbered” attributes. For example [event-source-id :345 event-id] where event-source is the entity representing the source of the ordered sequence of events and event is the event at index 345 in the vector.

In this case, [?event-source _ specific-event-id] is a legitimate query clause to find the source metadata relating to a specific event.

This use case uses potentially a large number of attributes. And this data design is the one of the few ones having similar performance characteristic to a vector in Clojure or array in other languages.

This use case was important enough to be integrated in the RDF schema standard: https://www.w3.org/TR/rdf-schema/#ch_containermembershipproperty

bluesun avatar Jul 07 '20 09:07 bluesun

Here is a use case that produces potentially large numbers of attributes and in which a query of the form of [?es _ e123] is relevant:

One way to import large vectors (for example an ordered sequence of events) is to use “numbered” attributes. For example [event-source-id :345 event-id] where event-source is the entity representing the source of the ordered sequence of events and event is the event at index 345 in the vector.

In this case, [?event-source _ specific-event-id] is a legitimate query clause to find the source metadata relating to a specific event.

This use case uses potentially a large number of attributes. And this data design is the one of the few ones having similar performance characteristic to a vector in Clojure or array in other languages.

This use case was important enough to be integrated in the RDF schema standard: https://www.w3.org/TR/rdf-schema/#ch_containermembershipproperty

Hi @tonsky , I am curious about your feedback on this point. I can easily extend the argumentation if you are interested...

bluesun avatar Aug 11 '20 12:08 bluesun

Thank you @bluesun, your argument is clear. I do not like the approach, it feels like misuse of the attribute field, it produces an unbounded amount of keywords, requires string concatenation/parsing if you want two ordered attributes. And, it works poorly with existing indexes structure :)

tonsky avatar Aug 11 '20 15:08 tonsky