[Bug]: Cant disable gleanings
Describe the bug
I tried setting the max_gleanings to 0 in settings.yaml to prevent the gleaning step from happening but in the indexing-engine.log file max_gleanings is still 1 and I can see the gleaning response in cache folder.
Steps to reproduce
- Run python -m graphrag.index --init --root ./ragtest
- change max_gleanings value under entity extraction to 0 (I also tried with None, null. Both didn't work)
- Run python -m graphrag.index --root ./ragtest
Expected Behavior
In the indexing-engine.log file under entity_extraction max_gleaning should be 0.
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: llama3-70b-8192 model_supports_json: true # recommended if this is available for your model.
max_tokens: 4000
request_timeout: 180.0
api_base: https://api.groq.com/openai/v1
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
tokens_per_minute: 5000 # set a leaky bucket throttle requests_per_minute: 30 # set a leaky bucket throttle max_retries: 10 max_retry_wait: 60.0
sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 25 # the number of parallel inflight requests that may be made
temperature: 0 # temperature for sampling
top_p: 1 # top-p sampling
n: 1 # Number of completions to generate
parallelization: stagger: 0.3
num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio llm: api_key: ${AZURE_API_KEY} type: azure_openai_embedding # or openai_embedding model: text-embedding-3-small api_base: https://graphrag-dataeaze.openai.azure.com/ api_version: 2024-02-15-preview # organization: <organization_id> deployment_name: graphrag_test # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional
chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents
input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"
cache: type: file # or blob base_dir: "cache"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
entity_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt" entity_types: ["Know Your Customer (KYC)", "Credit Cards", "Payment", "Anti-Money Laundering (AML)", "Bank Rate", "Repo Rate", "Debt", "Cash Reserve Ratio", "Commercial Banks", "Wilful Defaulters", "Financial Institutions", "Act", "Law", "Regulation", "Credit Information Companies", "Funds", "Risk Management", "Cooperative Banks", "Dealers", "Stocks", "Monetary Policy", "Inflation Control", "Banking Ombudsman", "Non-Banking Financial Companies (NBFCs)", "Financial Stability", "Basel Norms", "Liquidity Adjustment Facility (LAF)", "Statutory Liquidity Ratio (SLR)", "Priority Sector Lending", "Financial Inclusion", "Digital Payments", "Unified Payments Interface (UPI)", "Real Time Gross Settlement (RTGS)", "National Electronic Funds Transfer (NEFT)", "Deposit Insurance", "Currency Management", "Treasury Bills", "Government Securities", "Microfinance", "Payment Banks"]
max_gleanings: 0
summarize_descriptions:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt" max_length: 500
claim_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
enabled: true
prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0
community_reports:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt" max_length: 1000 max_input_length: 4000
cluster_graph: max_cluster_size: 10
embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap: enabled: false # if true, will generate UMAP embeddings for nodes
snapshots: graphml: false raw_entities: false top_level_nodes: false
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_mapped_entities: 10
top_k_relationships: 10
llm_temperature: 0 # temperature for sampling
llm_top_p: 1 # top-p sampling
llm_n: 1 # Number of completions to generate
max_tokens: 12000
global_search:
llm_temperature: 0 # temperature for sampling
llm_top_p: 1 # top-p sampling
llm_n: 1 # Number of completions to generate
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
Logs and screenshots
Image from indexing-engine.log
Additional Information
- GraphRAG Version:0.1.1
- Operating System:Linux
- Python Version:3.11
- Related Issues:
UPDATE: Setting max gleanings to -1 in settings.yaml seems to have the intended effect.