llama_index
llama_index copied to clipboard
Integrating AWS DocumentDB as a vector storage method
Description
This change adds a vector store integration for AWS Document DB. @vidur2 and I saw an issue opened to ask for this integration in the LangChain repository and believed that LLamaIndex users would find it useful as well given that we were using it for a project. There are no new dependencies, this integration just requires the pymongo library.
New Package?
Did I fill in the tool.llamahub
section in the pyproject.toml
and provide a detailed README.md for my new integration or package?
- [x] Yes
- [ ] No
Version Bump?
Did I bump the version in the pyproject.toml
file of the package I am updating? (Except for the llama-index-core
package)
- [x] Yes
- [ ] No
Type of Change
Please delete options that are not relevant.
- [x] New feature (non-breaking change which adds functionality)
- [x] This change requires a documentation update
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
- [x] Added new unit/integration tests
- [] Added new notebook (that tests end-to-end)
- [x] I stared at the code and made sure it makes sense
Suggested Checklist:
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [ ] I have added Google Colab support for the newly added notebooks.
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] I ran
make format; make lint
to appease the lint gods
Our unit tests require one to input their own AWS credentials to run. In order to avoid exposing AWS secrets, we didn't include ours as part of the PR. Is there some way we can get approved without having these tests pass?
The tests should mock out API requests, or should be skipped properly if creds are missing
The tests should mock out API requests, or should be skipped properly if creds are missing Should be fixed.
Check out this pull request onΒ
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@logan-markewich is there anything else we need to add so that this can be merged? thanks for guiding us through the process
@prajwalsaokar the tests were a little wonky -- we need to mark them as skip (which is what I did), if we aren't going to mock the API calls. Also cleaned up a few other things, should be good to go now
@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either a) things are properly mocked (api calls, etc.) b) things are properly skipped when not mocked and things are missing c) that tests actually work (I was hitting all kinds of errors actually haha)
@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either a) things are properly mocked (api calls, etc.) b) things are properly skipped when not mocked and things are missing c) that tests actually work (I was hitting all kinds of errors actually haha)
We just tried the tests on two machines, they should work it's just a bit of hassle to set them up because you need a DocumentDB instance and an EC2 instance thats connected to the DB instance with the correct networking config for the tests to run properly. Could you send over the errors you're getting?
We also think we implemented skipping correctly, but we didn't come up with a way to skip the tests if the EC2 configuration isn't correct since that's not visible to LlamaIndex. Can you clarify what changes you think we should make on the skip/mock front?
It looks like we should follow something like Cosmos or Postgres. They simply try to connect to DB and if there is a failure, they set a "*_not_available" variable which is used to skip the test. From a quick glance at tests written for a few vector stores, it doesn't look like mocking the API calls is typically done. I'll leave it up to Logan to provide requirements for the merge, but just wanted to leave this comment first.
Correct, usually you can do some sort of check and based on the condition, skip or not.
Mocking would allow the tests to actually run, in cicd, but it's usually more work (hence why other vectordbs don't have it)
I'll leave it up to you since this is your integration ππ» either is fine by me
Once cicd passes here I will merge
Seems there is an incorrect import, maybe take a double check
For example, should be from llama_index.core.schema import...
Tests are still failing with the same import error π Does these actually run locally for you? I'd be very surprised if they did. I'll push a fix
Thanks Logan for the help. Thank you Prajwal and Vidur for your contribution. I will try this out soon.
Thank you so much for all the help Logan, this was my first open source contribution and I learned a lot throughout this process. I'm very grateful for the time you spent ensuring that this integration could go through.
@prajwalsaokar thank you for the integration, seriously saved me a ton of work. Quick question though: I noticed you added the pre-filter in your code, but I couldn't find anything about pre-filtering in DocumentDB docs for vector search. Did you check if it is working on DocDB? Thanks again!