[Bug]: The overlay-defaults param does not provide default values for workflows using a vector store

Open nievespg1 opened this issue 1 year ago • 0 comments

Describe the bug

When a vector store is specified in the YAML configuration, attempting to execute an indexing job with the bash tool using the --overlay-defaults flag results in failure.

The issue stems from the create_pipeline_config function. This function takes all the parameters to generate the workflows and then hydrates the pipeline using the same embeddings.vector_store settings for all of them.

This is an issue because each workflow required a different title_column and id_column value.

Steps to reproduce

Define a vector store in your settings.yaml.

embeddings:
  <other-params>
      ...
  vector_store: 
      type: lancedb
      overwrite: true
      db_uri: /path/to/vector/db

Start a new indexing job using the bash utility and the settings.yaml file and the --overlay-defaults

python -m graphrag.index --overlay-defaults --verbose \
    --root <root_dir> \
    --config <root_dir>/settings.yaml \
    --reporter rich \
    --emit parquet

Expected Behavior

You should see an invalid column error as soon as the create_final_entities step starts running. The error will indicate that the title column is not in the Index.

GraphRAG Config Used

# Define anchors to be reused
openai_api_key_smt_octo: &openai_api_key_smt_octo ${OPENAI_API_KEY}

#######################
# pipeline parameters # 
#######################

# data inputs
input:
  type: file
  file_type: text
  file_pattern: .*\.txt$
  base_dir: <base/dir/path>

# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo

# text chunking
chunks:
  size: &chunk_size 1000 # 1000 tokens (about 4000 characters)
  overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
  strategy:
      type: tokens
      chunk_size: *chunk_size
      chunk_overlap: *chunk_overlap
      encoding_name: *encoding_name

# chat llm inputs
llm: &chat_llm
  api_key: *openai_api_key_smt_octo
  type: openai_chat
  model: gpt-4o-mini
  max_tokens: 4096
  request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
  api_version: "2024-02-01"
  # deployment_name: gpt-4o-mini
  model_supports_json: true
  tokens_per_minute: 1000000
  requests_per_minute: 10000
  max_retries: 20
  max_retry_wait: 10
  sleep_on_rate_limit_recommendation: true
  concurrent_requests: 25

parallelization: &parallelization
  stagger: 0.1
  num_threads: 50

async_mode: &async_mode asyncio
# async_mode: &async_mode threaded

entity_extraction:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  # prompt: &extraction_prompt !include <base/dir/path>/prompts/entity_extraction.txt
  prompt: <base/dir/path>/prompts/entity_extraction.txt
  max_gleanings: 1

summarize_descriptions:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt:<base/dir/path>/prompts/summarize_descriptions.txt
  max_length: 500

community_reports:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: <base/dir/path>/prompts/community_report.txt
  max_length: &max_report_length 2000
  max_input_length: 8000

# embeddings llm inputs
embeddings:
  llm:
    api_key: *openai_api_key_smt_octo
    type: openai_embedding
    model: text-embedding-ada-002
    request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
    api_version: "2024-02-01"
    # deployment_name: text-embedding-ada-002
    model_supports_json: false
    tokens_per_minute: 10000000
    requests_per_minute: 10000
    max_retries: 20
    max_retry_wait: 10
    sleep_on_rate_limit_recommendation: true
    concurrent_requests: 25
  parallelization: *parallelization
  async_mode: *async_mode
  batch_size: 16
  batch_max_tokens: 8191
  vector_store: 
      type: lancedb
      overwrite: true
      db_uri: <base/dir/path>/index/storage/lancedb
  
cache:
  type: file
  base_dir: <base/dir/path>/index/cache

storage:
  type: file
  base_dir: <base/dir/path>/index/storage

reporting:
  type: file
  base_dir: <base/dir/path>/index/reporting

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

#####################################
# orchestration (query) definitions # 
#####################################
local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_entities: 10
  top_k_relationships: 10
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  llm_max_tokens: 2000

global_search:
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 32

Logs and screenshots

No response

Additional Information

GraphRAG Version: 0.1.1
Operating System: 22.04.1-Ubuntu
Python Version: 3.11.9
Related Issues:

Jul 24 '24 01:07 nievespg1