graphrag [Bug]: `run_extract_entities` fails on long input

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When running the create_base_extracted_entities workflow on a large input file, in the run_extract_entities function, the call to text_splitter.split_text results in text_list having a different number of elements than docs. This means that the document indices in the results returned by the extractor do not align with the docs array, causing incorrect assignment of entities to docs, and potentially throwing an IndexError.

Steps to reproduce

Run run_pipeline_with_config with the following config:

workflows:
  - name: "create_base_extracted_entities"
    config:
      entity_extract:
          strategy: 
              type: graph_intelligence
              llm:
                type: openai_chat
                api_key: !ENV ${OPENAI_API_KEY}
                model: gpt-3.5-turbo
                temperature: 1.0

and with a target text file long enough that it is split into multiple chunks by the default text_splitter.

Expected Behavior

Entity extraction should complete successfully.

GraphRAG Config Used

workflows:
  - name: "create_base_extracted_entities"
    config:
      entity_extract:
          strategy: 
              type: graph_intelligence
              llm:
                type: openai_chat
                api_key: !ENV ${OPENAI_API_KEY}
                model: gpt-3.5-turbo
                temperature: 1.0

Logs and screenshots

  File "/Users/isaac/Library/Caches/pypoetry/virtualenvs/kb-proto-iXOcYfEQ-py3.12/lib/python3.12/site-packages/graphrag/index/verbs/entities/extraction/strategies/graph_intelligence/run_graph_intelligence.py", line 103, in <genexpr>
    docs[int(id)].id for id in node["source_id"].split(",")
    ~~~~^^^^^^^^^
IndexError: list index out of range

Additional Information

GraphRAG Version: 3.6
Operating System: Mac OS
Python Version: 3.12
Related Issues:

Oct 01 '24 18:10 isaac-pocketfm

Having a large document split into many text chunks is a very common setup. Can you upload your indexing-engine.log? How big is your input document? Do you know how many text chunks result?

Oct 01 '24 21:10 natoverse

Apologies, I'm new to the library and don't know how to generate indexing-engine.log. I observed the error on a document that is just long enough to be split into 2 chunks. Adding prechunked: true to the strategy was an effective workaround.

Oct 01 '24 22:10 isaac-pocketfm

Figured it out, here's the log from one of the offending documents: indexing-engine.log

Oct 01 '24 22:10 isaac-pocketfm

@natoverse I think you can remove the awaiting_response tag

Oct 04 '24 18:10 isaac-pocketfm

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

Oct 12 '24 01:10 github-actions[bot]

This issue has been closed after being marked as stale for five days. Please reopen if needed.

Oct 17 '24 01:10 github-actions[bot]