llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Integrating AWS DocumentDB as a vector storage method

Open prajwalsaokar opened this issue 10 months ago β€’ 2 comments

Description

This change adds a vector store integration for AWS Document DB. @vidur2 and I saw an issue opened to ask for this integration in the LangChain repository and believed that LLamaIndex users would find it useful as well given that we were using it for a project. There are no new dependencies, this integration just requires the pymongo library.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • [x] Yes
  • [ ] No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • [x] Yes
  • [ ] No

Type of Change

Please delete options that are not relevant.

  • [x] New feature (non-breaking change which adds functionality)
  • [x] This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • [x] Added new unit/integration tests
  • [] Added new notebook (that tests end-to-end)
  • [x] I stared at the code and made sure it makes sense

Suggested Checklist:

  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [ ] I have added Google Colab support for the newly added notebooks.
  • [x] My changes generate no new warnings
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes
  • [x] I ran make format; make lint to appease the lint gods

prajwalsaokar avatar Mar 24 '24 20:03 prajwalsaokar

Our unit tests require one to input their own AWS credentials to run. In order to avoid exposing AWS secrets, we didn't include ours as part of the PR. Is there some way we can get approved without having these tests pass?

vidur2 avatar Mar 26 '24 03:03 vidur2

The tests should mock out API requests, or should be skipped properly if creds are missing

logan-markewich avatar Mar 29 '24 15:03 logan-markewich

The tests should mock out API requests, or should be skipped properly if creds are missing Should be fixed.

vidur2 avatar Apr 01 '24 22:04 vidur2

Check out this pull request onΒ  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@logan-markewich is there anything else we need to add so that this can be merged? thanks for guiding us through the process

prajwalsaokar avatar Apr 13 '24 08:04 prajwalsaokar

@prajwalsaokar the tests were a little wonky -- we need to mark them as skip (which is what I did), if we aren't going to mock the API calls. Also cleaned up a few other things, should be good to go now

logan-markewich avatar Apr 15 '24 16:04 logan-markewich

@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either a) things are properly mocked (api calls, etc.) b) things are properly skipped when not mocked and things are missing c) that tests actually work (I was hitting all kinds of errors actually haha)

logan-markewich avatar Apr 15 '24 17:04 logan-markewich

@prajwalsaokar ngl the tests are pretty borked -- do you mind taking a look and ensuring either a) things are properly mocked (api calls, etc.) b) things are properly skipped when not mocked and things are missing c) that tests actually work (I was hitting all kinds of errors actually haha)

We just tried the tests on two machines, they should work it's just a bit of hassle to set them up because you need a DocumentDB instance and an EC2 instance thats connected to the DB instance with the correct networking config for the tests to run properly. Could you send over the errors you're getting?

prajwalsaokar avatar Apr 15 '24 21:04 prajwalsaokar

We also think we implemented skipping correctly, but we didn't come up with a way to skip the tests if the EC2 configuration isn't correct since that's not visible to LlamaIndex. Can you clarify what changes you think we should make on the skip/mock front?

prajwalsaokar avatar Apr 15 '24 22:04 prajwalsaokar

It looks like we should follow something like Cosmos or Postgres. They simply try to connect to DB and if there is a failure, they set a "*_not_available" variable which is used to skip the test. From a quick glance at tests written for a few vector stores, it doesn't look like mocking the API calls is typically done. I'll leave it up to Logan to provide requirements for the merge, but just wanted to leave this comment first.

samkhano1 avatar Apr 19 '24 17:04 samkhano1

Correct, usually you can do some sort of check and based on the condition, skip or not.

Mocking would allow the tests to actually run, in cicd, but it's usually more work (hence why other vectordbs don't have it)

I'll leave it up to you since this is your integration πŸ‘πŸ» either is fine by me

Once cicd passes here I will merge

logan-markewich avatar Apr 19 '24 18:04 logan-markewich

Seems there is an incorrect import, maybe take a double check

For example, should be from llama_index.core.schema import...

logan-markewich avatar Apr 20 '24 03:04 logan-markewich

Tests are still failing with the same import error πŸ˜“ Does these actually run locally for you? I'd be very surprised if they did. I'll push a fix

logan-markewich avatar Apr 22 '24 20:04 logan-markewich

Thanks Logan for the help. Thank you Prajwal and Vidur for your contribution. I will try this out soon.

ip9inderpreet avatar Apr 24 '24 19:04 ip9inderpreet

Thank you so much for all the help Logan, this was my first open source contribution and I learned a lot throughout this process. I'm very grateful for the time you spent ensuring that this integration could go through.

prajwalsaokar avatar Apr 26 '24 00:04 prajwalsaokar

@prajwalsaokar thank you for the integration, seriously saved me a ton of work. Quick question though: I noticed you added the pre-filter in your code, but I couldn't find anything about pre-filtering in DocumentDB docs for vector search. Did you check if it is working on DocDB? Thanks again!

jpcoutinho avatar May 06 '24 08:05 jpcoutinho