PGVector Support for Custom Connection Object
Why are these changes needed?
This PR contains adding support for custom psycopg connections.
A user can define the connection object.
This is important because a connection object may have to be very custom for certain environments. We should allow the end user to specify the connection object for their environment.
Fix included for .gitattributes to commit certain files with lf line endings instead of crlf. (This was breaking bash scripts in the repo)
Fix included for psycopg[binary] dependency being installed for Windows and Mac, Linux can use the pure python implementation psycopg.
conn = psycopg.connect(conninfo=connection_string_encoded, autocommit=True)
ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
retrieve_config={
"task": "code",
"docs_path": [
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md",
"https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md",
os.path.join(os.path.abspath(""), "..", "website", "docs"),
],
"custom_text_types": ["non-existent-type"],
"chunk_token_size": 2000,
"model": config_list[0]["model"],
"vector_db": "pgvector", # PGVector database
"db_config": {
"conn": conn
},
"get_or_create": True, # set to False if you don't want to reuse an existing collection
"overwrite": False, # set to True if you want to overwrite an existing collection
},
code_execution_config=False, # set to False if you don't want to execute the code
)
And pass that into the db_config for the retrieve agent.
This also contains a fix for the psycopg.connect() using the username field directly.
Related issue number
NA
Checks
- [x] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
- [x] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [X] I've made sure all auto checks have passed.
- [X] I have tested all 3 forms of authentication using a new PGVector docker image
⚠️ GitGuardian has uncovered 8 secrets following the scan of your pull request.
Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.
🔎 Detected hardcoded secrets in your pull request
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 10493810 | Triggered | Generic Password | 8d19f65bccb3e45db4c0945f7cd8fad07eab2b6a | test/agentchat/contrib/vectordb/test_pgvectordb.py | View secret |
| 10493810 | Triggered | Generic Password | 4b7ba2bc336415aead979598247c6fd5ab7e2689 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
| 10493810 | Triggered | Generic Password | 4b7ba2bc336415aead979598247c6fd5ab7e2689 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
| 10493810 | Triggered | Generic Password | 4b7ba2bc336415aead979598247c6fd5ab7e2689 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
| 10493810 | Triggered | Generic Password | fdbc3d5988b6f266ee5f96ba8c1cc8f23c8ee6e6 | test/agentchat/contrib/vectordb/test_pgvectordb.py | View secret |
| 10493810 | Triggered | Generic Password | 6e91d73def823664f1397da8692c2943bd9dd8c7 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
| 10493810 | Triggered | Generic Password | 10e2c2e7cbb8089ebcd61e5c780074269691da69 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
| 10493810 | Triggered | Generic Password | 10e2c2e7cbb8089ebcd61e5c780074269691da69 | notebook/agentchat_pgvector_RetrieveChat.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Codecov Report
Attention: Patch coverage is 0% with 50 lines in your changes are missing coverage. Please review.
Project coverage is 12.14%. Comparing base (
11d9336) to head (94b7fbb). Report is 17 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| autogen/agentchat/contrib/vectordb/pgvectordb.py | 0.00% | 43 Missing :warning: |
| setup.py | 0.00% | 7 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2566 +/- ##
===========================================
- Coverage 33.60% 12.14% -21.46%
===========================================
Files 87 87
Lines 9336 9417 +81
Branches 1987 2010 +23
===========================================
- Hits 3137 1144 -1993
- Misses 5933 8260 +2327
+ Partials 266 13 -253
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 12.14% <0.00%> (-21.46%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I just made 2 additional changes. I added a few file types in the .gitattributes. Many files were being committed with crlf line endings, which was breaking bash scripts within the repo. I added specific attributes to those file types to ensure they are committed with lf line endings. (This was breaking mypy).
Additionally, the setup.py was updated to include psycopg[binary] for windows and mac, and ubuntu uses the regular psycopg requirement. This is because Windows does not have the libpq5 dependency added. Within Ubuntu, this can be added by running: sudo apt install libpq5 -y, which will allow for a pure python implementation of psycopg.
@Knucklessg1 some tests failed. RetrieveChat tests may be related to your code change.
Message: 'Error connecting to the database: ' Arguments: (OperationalError('connection is bad: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?'),)
@Knucklessg1 some tests failed. RetrieveChat tests may be related to your code change.
Message: 'Error connecting to the database: ' Arguments: (OperationalError('connection is bad: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?'),)
I just pushed the latest commit. I was able to validate authentication in the notebook for all three authentication methods. I was unable to push the notebook because I do not have an LLM environment at the moment. The notebook was tested up to the collections being created for authentication.
Hi @Knucklessg1 thanks for this awesome added feature!
Not sure if this is the right place to ask this question but would appreciate any help on it. Is chunk token size being used to split docs while using pgvector as a vectordatabase. I don't quite see the code where it splits based on chunk token size (normal usage for local file) but max token of the model by default for each docs(link), which means that the full local docs/files will be added to the vectordb and be input into the context directy based on vector distance.
@chenyanbiao did you take a look at the retrieve_utils.py?
This is where the logic for the split is happening. It's split the same way regardless of vectordb backend.
@Knucklessg1 Thanks for the response. Yes, it is what I am looking at. I understand that both ways use the same logic of splitting. My confusion is that the non-vectordb solution parse the parameter of chunk_token_size (link) while the vectordb solution parse the parameter of max_token for the split function (link), which is not consistent.
@thinkall do you have any thoughts around this?