graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

fix: small-corpus path

Open KartikVashishta opened this issue 6 months ago • 2 comments

Description

Tiny corpora + CacheType.memory crashed with
ValueError: Columns must be same length as key (and follow-on dtype issues).

This PR hardens three spots:

  1. build_noun_graph._extract_edges

    • Expand edges column into source/target via pd.DataFrame, padding bad rows and dropping NaNs leading to no broadcast error.
  2. graph_to_dataframes

    • Store numpy embeddings as list so the single-column assignment can’t broadcast-fail.
  3. prune_graph

    • a) Return early for empty graphs.
    • b) astype(str) on source/target before merges to avoid float64 vs object mismatch.

Related Issues

Closes #1983

Proposed Changes

  • graphrag/index/operations/build_noun_graph/build_noun_graph.py
  • graphrag/index/operations/graph_to_dataframes.py
  • graphrag/index/operations/prune_graph.py
  • New test tests/unit/indexing/graph/test_small_corpus_bug.py

Checklist

  • [x] Tested locally on a one-line corpus (input/tiny.txt) with IndexingMethod.Fast & CacheType.memory.
  • [x] I have reviewed the code changes.
  • [x] I have updated the documentation (if necessary).
  • [x] I have added appropriate unit tests (if applicable).

Additional Notes

These fixes touch only the small-corpus edge-cases - normal pipelines are unaffected. Heads-up: the prompt files have been altered, can be discarded. Shout if you’d like them dropped or squashed.

KartikVashishta avatar Jun 25 '25 06:06 KartikVashishta

@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

KartikVashishta avatar Jun 25 '25 06:06 KartikVashishta

@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

KartikVashishta avatar Jun 25 '25 06:06 KartikVashishta