autogen icon indicating copy to clipboard operation
autogen copied to clipboard

[DRAFT] Graphrag integration

Open lspinheiro opened this issue 1 year ago • 2 comments

Why are these changes needed?

Related issue number

Checks

  • [ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
  • [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
  • [ ] I've made sure all auto checks have passed.

lspinheiro avatar Dec 09 '24 07:12 lspinheiro

hi @lspinheiro - this is exciting. its also marked as DRAFT in the subject line but not marked as such in the PR - I'm marking as draft and please set it back by clicking Ready to Review when you are ready.

rysweet avatar Dec 10 '24 17:12 rysweet

Exciting to see this!! I love the tool idea. The tool itself can also be stateful and shared by multiple agents.

ekzhu avatar Dec 12 '24 01:12 ekzhu

Thanks @ekzhu and @rysweet . This should be ready for review now. Still needs improvements as mentioned in the description, but the tools can be used. I used the following test script.

import asyncio
from autogen_core import CancellationToken
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
from autogen_ext.tools.graphrag import (
    GlobalSearchTool,
    LocalSearchTool,
    GlobalDataConfig,
    LocalDataConfig,
    EmbeddingConfig,
)
from azure.identity import DefaultAzureCredential, get_bearer_token_provider


async def main():
    openai_client = AzureOpenAIChatCompletionClient(
        model="gpt-4o-mini",
        azure_endpoint="https://<resource-name>.openai.azure.com", 
        azure_deployment="gpt-4o-mini",
        api_version="2024-08-01-preview",
        azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
    )

    # Global search example
    global_config = GlobalDataConfig(
        input_dir="./autogen-test/ragtest/output"
    )
    
    global_tool = GlobalSearchTool.from_config(
        openai_client=openai_client,
        data_config=global_config
    )

    global_args = {
        "query": "What does the station-master says about Dr. Becher?"
    }

    global_result = await global_tool.run_json(global_args, CancellationToken())
    print("\nGlobal Search Result:")
    print(global_result)
    
    # Local search example
    local_config = LocalDataConfig(
        input_dir="./autogen-test/ragtest/output"
    )

    embedding_config = EmbeddingConfig(
        model="text-embedding-3-small",
        api_base="https://<resource-name>.openai.azure.com", 
        deployment_name="text-embedding-3-small",
        api_version="2023-05-15",
        api_type="azure",
        azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
        max_retries=10,
        request_timeout=180.0,
    )

    local_tool = LocalSearchTool.from_config(
        openai_client=openai_client,
        data_config=local_config,
        embedding_config=embedding_config
    )

    local_args = {
        "query": "What does the station-master says about Dr. Becher?"
    }

    local_result = await local_tool.run_json(local_args, CancellationToken())
    print("\nLocal Search Result:")
    print(local_result)


if __name__ == "__main__":
    asyncio.run(main())

lspinheiro avatar Dec 17 '24 06:12 lspinheiro

@jackgerrits , I had to add verride-dependencies for pydantic and tenacity because the current version of pydantic is below their minimum requirement and there is a conflict with llamaindex which requires a lower version of tenancity, but it is a dev dependency for us. Let me know if you have any concerns with the approach

lspinheiro avatar Dec 17 '24 06:12 lspinheiro

Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.

gagb avatar Dec 17 '24 18:12 gagb

Related #4438

gagb avatar Dec 19 '24 18:12 gagb

Thank you! More documentation would help me review this PR. I would like to be able to build the docs page on this PR and see the example.

@gagb , I added a sample with a readme and some docstrings that should help with the review.

lspinheiro avatar Dec 20 '24 02:12 lspinheiro

Codecov Report

Attention: Patch coverage is 92.61745% with 11 lines in your changes missing coverage. Please review.

Project coverage is 69.40%. Comparing base (8efe0c4) to head (b372551). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...xt/src/autogen_ext/tools/graphrag/_local_search.py 88.88% 6 Missing :warning:
...t/src/autogen_ext/tools/graphrag/_global_search.py 89.58% 5 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4612      +/-   ##
==========================================
+ Coverage   69.07%   69.40%   +0.33%     
==========================================
  Files         159      163       +4     
  Lines       10346    10495     +149     
==========================================
+ Hits         7146     7284     +138     
- Misses       3200     3211      +11     
Flag Coverage Δ
unittests 69.40% <92.61%> (+0.33%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 03 '25 23:01 codecov[bot]

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

ekzhu avatar Jan 04 '25 08:01 ekzhu

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files

lspinheiro avatar Jan 06 '25 06:01 lspinheiro

How much data do you think it is ok to add?

How about a text file with 10 sentences? What is the size of the index?

ekzhu avatar Jan 06 '25 08:01 ekzhu

Let's add some unit tests? See the code coverage result. Is it possible to run a simple set up procedure with mini data set, perhaps generated?

How much data do you think it is ok to add? I think the sherlock holmes book generates roughly 10mb of data between the parquet and vector db files. I can try to look into something smaller but I dont know how to estimate output files from input in graphrag so hard to say how much size I need to store in the repo as test data files

Is there a mirror we can fetch it from instead of including it in the repo?

jackgerrits avatar Jan 06 '25 14:01 jackgerrits

@ekzhu @jackgerrits , I added the data in a conftest file. Since we are mocking the LLM calls it wont matter as much

lspinheiro avatar Jan 07 '25 12:01 lspinheiro

@ekzhu , CI and docstrings fixed. I also update all examples to use component config but didn't create a copy in the graphrag sample to avoid a lot of duplicate code. I added a comment instead referring to the chainlit template.

I had to add some dependency overrides because of conflicts between chainlit and graphrag, I will create an issue to see if we can resolve this with better defined dependency-groups. CC @jackgerrits

lspinheiro avatar Jan 11 '25 02:01 lspinheiro

@ekzhu , I have update the PR based on the feedback.

lspinheiro avatar Jan 13 '25 22:01 lspinheiro