fix: small-corpus path
Description
Tiny corpora + CacheType.memory crashed with
ValueError: Columns must be same length as key (and follow-on dtype issues).
This PR hardens three spots:
-
build_noun_graph._extract_edges
- Expand
edgescolumn intosource/targetviapd.DataFrame, padding bad rows and dropping NaNs leading to no broadcast error.
- Expand
-
graph_to_dataframes
- Store numpy embeddings as
listso the single-column assignment can’t broadcast-fail.
- Store numpy embeddings as
-
prune_graph
- a) Return early for empty graphs.
- b)
astype(str)onsource/targetbefore merges to avoidfloat64 vs objectmismatch.
Related Issues
Closes #1983
Proposed Changes
graphrag/index/operations/build_noun_graph/build_noun_graph.pygraphrag/index/operations/graph_to_dataframes.pygraphrag/index/operations/prune_graph.py- New test
tests/unit/indexing/graph/test_small_corpus_bug.py
Checklist
- [x] Tested locally on a one-line corpus (
input/tiny.txt) withIndexingMethod.Fast&CacheType.memory. - [x] I have reviewed the code changes.
- [x] I have updated the documentation (if necessary).
- [x] I have added appropriate unit tests (if applicable).
Additional Notes
These fixes touch only the small-corpus edge-cases - normal pipelines are unaffected. Heads-up: the prompt files have been altered, can be discarded. Shout if you’d like them dropped or squashed.
@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree
@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree