legion
legion copied to clipboard
Content-store indexation integration
At the end of the 1 sprint, the the content store indexation is completed , it should be integrated into the data pipeline flow.
Summary
This task assumes that #1523 is completed and available in the integration branch.
Now that the content-store library has first-class support for indexation (index trees distributed in the content-store), we need to integrate it in the various parts of our pipeline.
Namely, but perhaps not exhaustively, this means we need to work on:
Source-control integration
Replace the current file-system-based implementation of the lgn-source-control
crate so that it uses thoses indexes. We can start easy, by only adding the resources to one index but plan for the future and make supporting multiple indexes easy enough.
Right now, the source-control assumes that all its content are "files" that each have an on-disk representation in local workspaces. Basically, a workspace is the materialization of a given commit on disk and provides a familiar interface for people. While this is great in terms of familiarity, it hinders our cloud-nativeness by forcing us to download to disk and upload from disk all the time.
We need to change many things in the source-control crate:
- We need to remove the concept of disk-based workspace (or perhaps even remove the concept of workspace altogether?). After all, if everything is cloud-based, the only thing people require to actually commit changes is to know the root of an index tree. Commiting becomes the act of modifying things in that tree, and storing back the new tree index identifier (which is guaranteed to differ if there are any changes).
- We used the singular form in the previous point, by talking about "an index tree" but ideally, we should support adding/updating/removing resources to more than one index tree at once. Each index would be used in different context: storing resources by OID, storing by reverse dependency, storing by virtual path (like a filesystem path), storing by coordinates, ...
- Resources need to implement some sort of interface that can return the index keys to store them in the various tree. Not all resources need to be stored in all trees: for instance, if a resource does not have coordinates, it does not make sense to add it to an index of resources by coordinates. Also, a given resource could appear several times in an index. For instance, if we have an index of resources by categories, if a resource belongs to several categories, it would appear in the index several times. As such, this interface should likely look similar to
fn get_index_key_for_index_name(index_name: &str) -> Vec<CompositeIndexKey>
where returning an emptyVec
means : "do not index this resource". - Like the current source-control CLI may need to be heavily adapted (it stays relevant for commands like
create-repository
and such, but perhaps checkouting files on disk does not make sense anymore especially as many resources may not even have a "filesystem path" anymore. - The source-control filesystem cli may also be simply disabled for now as it probably also doesn't make much sense either in that context.
Data-pipeline integration
Following the changes to the source-control, one will also need to update all invokation of the source-control workspaces to use the refreshed concept. Basically, make it so everything compiles and work after the other changes are submitted :D
Current State
We still use a file-based implementation which can't possibly scale to the extent we want to scale at.
Work Items
- [x] (1w) #1625
- [x] #1801
- [ ] #1802
- [ ] For another milestone: Query for children: Pass the index to compilers and have dependency checker