graphrag
graphrag copied to clipboard
[Bug]: The overlay-defaults param does not provide default values for workflows using a vector store
Describe the bug
When a vector store is specified in the YAML configuration, attempting to execute an indexing job with the bash tool using the --overlay-defaults flag results in failure.
The issue stems from the create_pipeline_config function. This function takes all the parameters to generate the workflows and then hydrates the pipeline using the same embeddings.vector_store settings for all of them.
This is an issue because each workflow required a different title_column and id_column value.
Steps to reproduce
- Define a vector store in your
settings.yaml.
embeddings:
<other-params>
...
vector_store:
type: lancedb
overwrite: true
db_uri: /path/to/vector/db
- Start a new indexing job using the bash utility and the
settings.yamlfile and the--overlay-defaults
python -m graphrag.index --overlay-defaults --verbose \
--root <root_dir> \
--config <root_dir>/settings.yaml \
--reporter rich \
--emit parquet
Expected Behavior
You should see an invalid column error as soon as the create_final_entities step starts running. The error will indicate that the title column is not in the Index.
GraphRAG Config Used
# Define anchors to be reused
openai_api_key_smt_octo: &openai_api_key_smt_octo ${OPENAI_API_KEY}
#######################
# pipeline parameters #
#######################
# data inputs
input:
type: file
file_type: text
file_pattern: .*\.txt$
base_dir: <base/dir/path>
# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo
# text chunking
chunks:
size: &chunk_size 1000 # 1000 tokens (about 4000 characters)
overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
strategy:
type: tokens
chunk_size: *chunk_size
chunk_overlap: *chunk_overlap
encoding_name: *encoding_name
# chat llm inputs
llm: &chat_llm
api_key: *openai_api_key_smt_octo
type: openai_chat
model: gpt-4o-mini
max_tokens: 4096
request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
api_version: "2024-02-01"
# deployment_name: gpt-4o-mini
model_supports_json: true
tokens_per_minute: 1000000
requests_per_minute: 10000
max_retries: 20
max_retry_wait: 10
sleep_on_rate_limit_recommendation: true
concurrent_requests: 25
parallelization: ¶llelization
stagger: 0.1
num_threads: 50
async_mode: &async_mode asyncio
# async_mode: &async_mode threaded
entity_extraction:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
# prompt: &extraction_prompt !include <base/dir/path>/prompts/entity_extraction.txt
prompt: <base/dir/path>/prompts/entity_extraction.txt
max_gleanings: 1
summarize_descriptions:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
prompt:<base/dir/path>/prompts/summarize_descriptions.txt
max_length: 500
community_reports:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
prompt: <base/dir/path>/prompts/community_report.txt
max_length: &max_report_length 2000
max_input_length: 8000
# embeddings llm inputs
embeddings:
llm:
api_key: *openai_api_key_smt_octo
type: openai_embedding
model: text-embedding-ada-002
request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
api_version: "2024-02-01"
# deployment_name: text-embedding-ada-002
model_supports_json: false
tokens_per_minute: 10000000
requests_per_minute: 10000
max_retries: 20
max_retry_wait: 10
sleep_on_rate_limit_recommendation: true
concurrent_requests: 25
parallelization: *parallelization
async_mode: *async_mode
batch_size: 16
batch_max_tokens: 8191
vector_store:
type: lancedb
overwrite: true
db_uri: <base/dir/path>/index/storage/lancedb
cache:
type: file
base_dir: <base/dir/path>/index/cache
storage:
type: file
base_dir: <base/dir/path>/index/storage
reporting:
type: file
base_dir: <base/dir/path>/index/reporting
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
#####################################
# orchestration (query) definitions #
#####################################
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_entities: 10
top_k_relationships: 10
temperature: 0.0
top_p: 1.0
n: 1
max_tokens: 12000
llm_max_tokens: 2000
global_search:
temperature: 0.0
top_p: 1.0
n: 1
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
Logs and screenshots
No response
Additional Information
- GraphRAG Version: 0.1.1
- Operating System: 22.04.1-Ubuntu
- Python Version: 3.11.9
- Related Issues: