[DRAFT] Graphrag integration
Why are these changes needed?
Related issue number
Checks
- [ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.
hi @lspinheiro - this is exciting. its also marked as DRAFT in the subject line but not marked as such in the PR - I'm marking as draft and please set it back by clicking Ready to Review when you are ready.
Exciting to see this!! I love the tool idea. The tool itself can also be stateful and shared by multiple agents.
Thanks @ekzhu and @rysweet . This should be ready for review now. Still needs improvements as mentioned in the description, but the tools can be used. I used the following test script.
import asyncio
from autogen_core import CancellationToken
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
from autogen_ext.tools.graphrag import (
GlobalSearchTool,
LocalSearchTool,
GlobalDataConfig,
LocalDataConfig,
EmbeddingConfig,
)
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
async def main():
openai_client = AzureOpenAIChatCompletionClient(
model="gpt-4o-mini",
azure_endpoint="https://<resource-name>.openai.azure.com",
azure_deployment="gpt-4o-mini",
api_version="2024-08-01-preview",
azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
)
# Global search example
global_config = GlobalDataConfig(
input_dir="./autogen-test/ragtest/output"
)
global_tool = GlobalSearchTool.from_config(
openai_client=openai_client,
data_config=global_config
)
global_args = {
"query": "What does the station-master says about Dr. Becher?"
}
global_result = await global_tool.run_json(global_args, CancellationToken())
print("\nGlobal Search Result:")
print(global_result)
# Local search example
local_config = LocalDataConfig(
input_dir="./autogen-test/ragtest/output"
)
embedding_config = EmbeddingConfig(
model="text-embedding-3-small",
api_base="https://<resource-name>.openai.azure.com",
deployment_name="text-embedding-3-small",
api_version="2023-05-15",
api_type="azure",
azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
max_retries=10,
request_timeout=180.0,
)
local_tool = LocalSearchTool.from_config(
openai_client=openai_client,
data_config=local_config,
embedding_config=embedding_config
)
local_args = {
"query": "What does the station-master says about Dr. Becher?"
}
local_result = await local_tool.run_json(local_args, CancellationToken())
print("\nLocal Search Result:")
print(local_result)
if __name__ == "__main__":
asyncio.run(main())
@jackgerrits , I had to add verride-dependencies for pydantic and tenacity because the current version of pydantic is below their minimum requirement and there is a conflict with llamaindex which requires a lower version of tenancity, but it is a dev dependency for us. Let me know if you have any concerns with the approach
Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.
Related #4438
Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.
@gagb , I added a sample with a readme and some docstrings that should help with the review.
Codecov Report
Attention: Patch coverage is 92.61745% with 11 lines in your changes missing coverage. Please review.
Project coverage is 69.40%. Comparing base (
8efe0c4) to head (b372551). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #4612 +/- ##
==========================================
+ Coverage 69.07% 69.40% +0.33%
==========================================
Files 159 163 +4
Lines 10346 10495 +149
==========================================
+ Hits 7146 7284 +138
- Misses 3200 3211 +11
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 69.40% <92.61%> (+0.33%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?
Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?
How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files
How much data do you think it is ok to add?
How about a text file with 10 sentences? What is the size of the index?
Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?
How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files
Is there a mirror we can fetch it from instead of including it in the repo?
@ekzhu @jackgerrits , I added the data in a conftest file. Since we are mocking the LLM calls it wont matter as much
@ekzhu , CI and docstrings fixed. I also update all examples to use component config but didn't create a copy in the graphrag sample to avoid a lot of duplicate code. I added a comment instead referring to the chainlit template.
I had to add some dependency overrides because of conflicts between chainlit and graphrag, I will create an issue to see if we can resolve this with better defined dependency-groups. CC @jackgerrits
@ekzhu , I have update the PR based on the feedback.