legion Content-store indexation integration

Content-store indexation integration

Open tltran-legion opened this issue 2 years ago • 1 comments

At the end of the 1 sprint, the the content store indexation is completed , it should be integrated into the data pipeline flow.

Summary

This task assumes that #1523 is completed and available in the integration branch.

Now that the content-store library has first-class support for indexation (index trees distributed in the content-store), we need to integrate it in the various parts of our pipeline.

Namely, but perhaps not exhaustively, this means we need to work on:

Source-control integration

Replace the current file-system-based implementation of the lgn-source-control crate so that it uses thoses indexes. We can start easy, by only adding the resources to one index but plan for the future and make supporting multiple indexes easy enough.

Right now, the source-control assumes that all its content are "files" that each have an on-disk representation in local workspaces. Basically, a workspace is the materialization of a given commit on disk and provides a familiar interface for people. While this is great in terms of familiarity, it hinders our cloud-nativeness by forcing us to download to disk and upload from disk all the time.

We need to change many things in the source-control crate:

We need to remove the concept of disk-based workspace (or perhaps even remove the concept of workspace altogether?). After all, if everything is cloud-based, the only thing people require to actually commit changes is to know the root of an index tree. Commiting becomes the act of modifying things in that tree, and storing back the new tree index identifier (which is guaranteed to differ if there are any changes).
We used the singular form in the previous point, by talking about "an index tree" but ideally, we should support adding/updating/removing resources to more than one index tree at once. Each index would be used in different context: storing resources by OID, storing by reverse dependency, storing by virtual path (like a filesystem path), storing by coordinates, ...
Resources need to implement some sort of interface that can return the index keys to store them in the various tree. Not all resources need to be stored in all trees: for instance, if a resource does not have coordinates, it does not make sense to add it to an index of resources by coordinates. Also, a given resource could appear several times in an index. For instance, if we have an index of resources by categories, if a resource belongs to several categories, it would appear in the index several times. As such, this interface should likely look similar to fn get_index_key_for_index_name(index_name: &str) -> Vec<CompositeIndexKey> where returning an empty Vec means : "do not index this resource".
Like the current source-control CLI may need to be heavily adapted (it stays relevant for commands like create-repository and such, but perhaps checkouting files on disk does not make sense anymore especially as many resources may not even have a "filesystem path" anymore.
The source-control filesystem cli may also be simply disabled for now as it probably also doesn't make much sense either in that context.

Data-pipeline integration

Following the changes to the source-control, one will also need to update all invokation of the source-control workspaces to use the refreshed concept. Basically, make it so everything compiles and work after the other changes are submitted :D

Current State

We still use a file-based implementation which can't possibly scale to the extent we want to scale at.

Work Items

[x] (1w) #1625
[x] #1801
[ ] #1802
[ ] For another milestone: Query for children: Pass the index to compilers and have dependency checker

Apr 21 '22 19:04 tltran-legion

legion legion copied to clipboard

Content-store indexation integration

Summary

Source-control integration

Data-pipeline integration

Current State

Work Items

legion
legion copied to clipboard