autogen icon indicating copy to clipboard operation
autogen copied to clipboard

PGVector Support for Custom Connection Object

Open Knucklessg1 opened this issue 1 year ago • 3 comments

Why are these changes needed?

This PR contains adding support for custom psycopg connections.

A user can define the connection object.

This is important because a connection object may have to be very custom for certain environments. We should allow the end user to specify the connection object for their environment.

Fix included for .gitattributes to commit certain files with lf line endings instead of crlf. (This was breaking bash scripts in the repo)

Fix included for psycopg[binary] dependency being installed for Windows and Mac, Linux can use the pure python implementation psycopg.

conn = psycopg.connect(conninfo=connection_string_encoded, autocommit=True)

ragproxyagent = RetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=1,
    retrieve_config={
        "task": "code",
        "docs_path": [
            "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md",
            "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md",
            os.path.join(os.path.abspath(""), "..", "website", "docs"),
        ],
        "custom_text_types": ["non-existent-type"],
        "chunk_token_size": 2000,
        "model": config_list[0]["model"],
        "vector_db": "pgvector",  # PGVector database
        "db_config": {
            "conn": conn 
        },
        "get_or_create": True,  # set to False if you don't want to reuse an existing collection
        "overwrite": False,  # set to True if you want to overwrite an existing collection
    },
    code_execution_config=False,  # set to False if you don't want to execute the code
)

And pass that into the db_config for the retrieve agent.

This also contains a fix for the psycopg.connect() using the username field directly.

Related issue number

NA

Checks

  • [x] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
  • [x] I've added tests (if relevant) corresponding to the changes introduced in this PR.
  • [X] I've made sure all auto checks have passed.
  • [X] I have tested all 3 forms of authentication using a new PGVector docker image

Knucklessg1 avatar May 01 '24 21:05 Knucklessg1

⚠️ GitGuardian has uncovered 8 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10493810 Triggered Generic Password 8d19f65bccb3e45db4c0945f7cd8fad07eab2b6a test/agentchat/contrib/vectordb/test_pgvectordb.py View secret
10493810 Triggered Generic Password 4b7ba2bc336415aead979598247c6fd5ab7e2689 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 4b7ba2bc336415aead979598247c6fd5ab7e2689 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 4b7ba2bc336415aead979598247c6fd5ab7e2689 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password fdbc3d5988b6f266ee5f96ba8c1cc8f23c8ee6e6 test/agentchat/contrib/vectordb/test_pgvectordb.py View secret
10493810 Triggered Generic Password 6e91d73def823664f1397da8692c2943bd9dd8c7 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 10e2c2e7cbb8089ebcd61e5c780074269691da69 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 10e2c2e7cbb8089ebcd61e5c780074269691da69 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

gitguardian[bot] avatar May 01 '24 21:05 gitguardian[bot]

Codecov Report

Attention: Patch coverage is 0% with 50 lines in your changes are missing coverage. Please review.

Project coverage is 12.14%. Comparing base (11d9336) to head (94b7fbb). Report is 17 commits behind head on main.

Files Patch % Lines
autogen/agentchat/contrib/vectordb/pgvectordb.py 0.00% 43 Missing :warning:
setup.py 0.00% 7 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2566       +/-   ##
===========================================
- Coverage   33.60%   12.14%   -21.46%     
===========================================
  Files          87       87               
  Lines        9336     9417       +81     
  Branches     1987     2010       +23     
===========================================
- Hits         3137     1144     -1993     
- Misses       5933     8260     +2327     
+ Partials      266       13      -253     
Flag Coverage Δ
unittests 12.14% <0.00%> (-21.46%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 04 '24 15:05 codecov-commenter

I just made 2 additional changes. I added a few file types in the .gitattributes. Many files were being committed with crlf line endings, which was breaking bash scripts within the repo. I added specific attributes to those file types to ensure they are committed with lf line endings. (This was breaking mypy).

Additionally, the setup.py was updated to include psycopg[binary] for windows and mac, and ubuntu uses the regular psycopg requirement. This is because Windows does not have the libpq5 dependency added. Within Ubuntu, this can be added by running: sudo apt install libpq5 -y, which will allow for a pure python implementation of psycopg.

Knucklessg1 avatar May 08 '24 15:05 Knucklessg1

@Knucklessg1 some tests failed. RetrieveChat tests may be related to your code change.

Message: 'Error connecting to the database: ' Arguments: (OperationalError('connection is bad: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?'),)

thinkall avatar May 22 '24 10:05 thinkall

@Knucklessg1 some tests failed. RetrieveChat tests may be related to your code change.

Message: 'Error connecting to the database: ' Arguments: (OperationalError('connection is bad: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?'),)

I just pushed the latest commit. I was able to validate authentication in the notebook for all three authentication methods. I was unable to push the notebook because I do not have an LLM environment at the moment. The notebook was tested up to the collections being created for authentication.

Knucklessg1 avatar May 22 '24 17:05 Knucklessg1

Hi @Knucklessg1 thanks for this awesome added feature!

Not sure if this is the right place to ask this question but would appreciate any help on it. Is chunk token size being used to split docs while using pgvector as a vectordatabase. I don't quite see the code where it splits based on chunk token size (normal usage for local file) but max token of the model by default for each docs(link), which means that the full local docs/files will be added to the vectordb and be input into the context directy based on vector distance.

chenyanbiao avatar May 31 '24 05:05 chenyanbiao

@chenyanbiao did you take a look at the retrieve_utils.py?

This is where the logic for the split is happening. It's split the same way regardless of vectordb backend.

Knucklessg1 avatar May 31 '24 21:05 Knucklessg1

@Knucklessg1 Thanks for the response. Yes, it is what I am looking at. I understand that both ways use the same logic of splitting. My confusion is that the non-vectordb solution parse the parameter of chunk_token_size (link) while the vectordb solution parse the parameter of max_token for the split function (link), which is not consistent.

chenyanbiao avatar Jun 01 '24 15:06 chenyanbiao

@thinkall do you have any thoughts around this?

Knucklessg1 avatar Jun 04 '24 16:06 Knucklessg1