forest
forest copied to clipboard
On-demand multi-car indexing
Issue summary
ForestCAR files offer O(1) look-up of IPLD values. Unfortunately, with more than one ForestCAR file, the look-up time grows linearly and quickly becomes prohibitive. One option is to merge the ForestCAR files, but this is expensive. A cheaper option would be to build a new, ephemeral, on-disk index that covers multiple ForestCAR files.
Buckets in a ForestCAR file are stored sorted from low to high. This makes it possible to merge two ForestCAR indexes efficiently and linearly. It's possible to open a ForestCAR file and get a stream of CIDs with the guarantee that the bucket number is growing monotonically.
Tasks:
- [ ] Get a stream of CIDs from ForestCAR.
- [ ] Get the number of CIDs from ForestCAR.
- [ ] Merge multiple CID streams and linearly write out hash-table buckets to disk.
- [ ] Create a new data store that uses a single index to query multiple ForestCAR files.
- [ ] Define property tests to show that the new data store is semantically the same as a
ManyCar. - [ ] Add benchmark command to
forest-toolfor generating stand-alone indexes. - [ ] Measure performance differences when graph-walking all of the diff calibnet snapshots.
Other information and links
Summary from Slack discussion: The tasks outline a proposed solution. Using an off-the-shelf on-disk index would also be acceptable. PairtyDB (or any other on-disk key-value DB) could store an index from CID to CAR frame.