chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[ENH][SEC]: Move HNSW Index metadata (pickle) to protobuf binary file

Open tazarov opened this issue 1 year ago • 4 comments

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • We persist the HNSW max_seq_id when the index is written; this will help with WAL cleanup and will not require loading the HNSW index before cleanup.

Test plan

How are these changes tested?

  • [x] Tests pass locally with pytest for python

Documentation Changes

N/A

tazarov avatar Mar 13 '24 17:03 tazarov

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • [ ] Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • [ ] Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • [ ] If appropriate, are there adequate property based tests?
  • [ ] If appropriate, are there adequate unit tests?
  • [ ] Should any logging, debugging, tracing information be added or removed?
  • [ ] Are error messages user-friendly?
  • [ ] Have all documentation changes needed been made?
  • [ ] Have all non-obvious changes been commented?

System Compatibility

  • [ ] Are there any potential impacts on other parts of the system or backward compatibility?
  • [ ] Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • [ ] Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

github-actions[bot] avatar Mar 13 '24 17:03 github-actions[bot]

This is already persisted as part of the "PersistentData" in the persistent hnsw segment isn't it?

HammadB avatar Mar 13 '24 18:03 HammadB

that is true, but we do have to load the metadata to find out the max_seq_id, so I thought, why not keep it in the WAL max_seq_id table?

I think you even have a comment in the HNSW index to move the metadata in sqlite.

tazarov avatar Mar 13 '24 19:03 tazarov

@HammadB, have a look. I think this should work.

The pickle file metadata is now moved to segment metadata, including the dict mappings between id - label - seqId (dumped as json tuple). Performance is on par with pickle dumps/loads.

WIP - tests for migration from pickle file (it works but it has to be tested)

tazarov avatar Mar 26 '24 14:03 tazarov