forest icon indicating copy to clipboard operation
forest copied to clipboard

On-demand multi-car indexing

Open lemmih opened this issue 2 years ago • 1 comments

Issue summary

ForestCAR files offer O(1) look-up of IPLD values. Unfortunately, with more than one ForestCAR file, the look-up time grows linearly and quickly becomes prohibitive. One option is to merge the ForestCAR files, but this is expensive. A cheaper option would be to build a new, ephemeral, on-disk index that covers multiple ForestCAR files.

Buckets in a ForestCAR file are stored sorted from low to high. This makes it possible to merge two ForestCAR indexes efficiently and linearly. It's possible to open a ForestCAR file and get a stream of CIDs with the guarantee that the bucket number is growing monotonically.

Tasks:

  • [ ] Get a stream of CIDs from ForestCAR.
  • [ ] Get the number of CIDs from ForestCAR.
  • [ ] Merge multiple CID streams and linearly write out hash-table buckets to disk.
  • [ ] Create a new data store that uses a single index to query multiple ForestCAR files.
  • [ ] Define property tests to show that the new data store is semantically the same as a ManyCar.
  • [ ] Add benchmark command to forest-tool for generating stand-alone indexes.
  • [ ] Measure performance differences when graph-walking all of the diff calibnet snapshots.

Other information and links

lemmih avatar Oct 10 '23 11:10 lemmih

Summary from Slack discussion: The tasks outline a proposed solution. Using an off-the-shelf on-disk index would also be acceptable. PairtyDB (or any other on-disk key-value DB) could store an index from CID to CAR frame.

lemmih avatar Oct 16 '23 11:10 lemmih