Representing Hierarchies in the Catalog DB
We had a very interesting conversation with @danielballan yesterday about the structure of the catalog.db database, specifically the representation of the hierarchy of nodes.
The rows of the nodes table store metadata of assets that often (always?) are in a hierarchical relationship with each other (e.g. a tree of folders). Currently, this structure is encoded by storing lists of ancestors of each node. Since several nodes may have same uids (e.g. names of subfolders in separate paths of the tree), the lists of ancestors serve as a second part of the composite key to ensure uniqueness (please see attached). While having lists of ancestors readily available may improve the query performance, it also raises the question of the efficiency of data storage and the complexity of the key.
A possible solution could be:
- Ensure that
uidcolumn uniquely identifies each node in the table. - Use these uids to capture the tree structure in an external linked "closure" table. The idea is to keep track of every "ancestor-descendant" pair (not just "parent-child") along with the corresponding distance (depth) between them. Assuming that each row is it's own descendant and ancestor at depth 0, this leads to fast rules for data queries and updates without the need to keep lists of references.
For small (shallow) hierarchies the befits of using closure tables could be minimal, but it should be quite noticeable with larger datasets and would help us to future-proof the database schema. Please share your thoughts and comments! @danielballan, @dylanmcreynolds, @Kezzsim
The closure table looks like very elegant pattern that applies well to our use case. I like that triggers can be used to keep the closure table in sync with the "main" table (in our case, the nodes table) without the application getting directly involved.
I think we can use the id as a unique ID for each node.
https://github.com/bluesky/tiled/blob/63b583be42fbe69349cee9b3be01a7a056e5eb04/tiled/catalog/orm.py#L62-L63
The key column, which is sometimes a BlueskyRun <uid> but is in general is any string, is akin to name in the example in that article. The key / name can repeat; we just need a unique constraint on (key, parent). (Currently we have a unique constraint on (key, ancestors).)
I think makes sense to sketch this at a level we can benchmark it against realistic data at a realistic scale.