[Bug]: Search fails with path errors with custom storage and reporting paths
Describe the bug
I am trying to have a single folder where I can store indexed artifacts for the documents. I do not want to create different indexed folders by timestamps because when I add new documents, a new folder is generated. Consequently, when I want to perform a query, I do not know which index artifact to point to and load for the response.
As per discussion 354, we can add new documents and rerun the indexer, which will add new data to the summaries. Therefore, I modified the paths for storage and reports, which led to the issues mentioned below.
from base_dir: "output/${timestamp}/artifacts" to
-
storage: type: file # or blob base_dir: "output/files/artifacts"
-
reporting: type: file # or console, blob base_dir: "output/files/reports"
issues:
1.
indexing engine logs still generating under timestamp folder, which leads to issues ahead while performing queries with local and global searches.
case A: performing global and local searchs fails as it is not able to locate or load the correct path. when storage base_dir: "output/artifacts", it seems that the returned path is output/artifacts/artifact which does not exists at all.
def _infer_data_dir(root: str) -> str: output = Path(root) / "output" # use the latest data-run folder if output.exists(): folders = sorted(output.iterdir(), key=os.path.getmtime, reverse=True) if len(folders) > 0: folder = folders[0] return str((folder / "artifacts").absolute()) msg = f"Could not infer data directory from root={root}" raise ValueError(msg)
case B: When storage base_dir: "output/files/artifacts", and if there are multiple runs, the indexing-engine.log is still generated under timestamp folders. According to the logic, the folders are always sorted by the last modified timestamp. Now, the sorted folders are: Folders:
- WindowsPath('ragtest/output/20240717-131540')
- WindowsPath('ragtest/output/files')
- WindowsPath('ragtest/output/20240717-131330')
folder = folders[0], returns ragtest/output/20240717-131540/artifacts which is invalid
def _infer_data_dir(root: str) -> str: output = Path(root) / "output" # use the latest data-run folder if output.exists(): folders = sorted(output.iterdir(), key=os.path.getmtime, reverse=True) if len(folders) > 0: folder = folders[0] return str((folder / "artifacts").absolute()) msg = f"Could not infer data directory from root={root}" raise ValueError(msg)
Result: local and global searches errors out due to path issues Note: I am running my llm models using llm studio and ollama locally to save costs.
Steps to reproduce
-
modify paths from base_dir: "output/${timestamp}/artifacts" to storage: type: file # or blob base_dir: "output/files/artifacts"
reporting: type: file # or console, blob base_dir: "output/files/reports"
-
run indexer tow or more times
-
run a globar search
Expected Behavior
- all reports and artifacts should be generated in the specified paths
- paths should be loaded as per the paths modifed or mentioned in the settings, may or may not include timestamps
- end goals is to be able to add new documents over existing indexed data and avoid running indexing on entire documents again, if we need to use a timestamp based output artifacts how can one query all over the old and new indexed data
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: mistral model_supports_json: true # recommended if this is available for your model.
parallelization: stagger: 0.3
async_mode: threaded # or asyncio
embeddings: async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf
chunks: size: 400 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents
input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"
cache: type: file # or blob base_dir: "cache"
storage: type: file # or blob base_dir: "output/files/artifacts"
reporting: type: file # or console, blob base_dir: "output/files/reports"
entity_extraction: prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0
summarize_descriptions: prompt: "prompts/summarize_descriptions.txt" max_length: 200
claim_extraction: prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0
community_report: prompt: "prompts/community_report.txt" max_length: 1000 max_input_length: 3000
cluster_graph: max_cluster_size: 10
embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes
umap: enabled: false # if true, will generate UMAP embeddings for nodes
snapshots: graphml: false raw_entities: false top_level_nodes: false
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_mapped_entities: 10
top_k_relationships: 10
max_tokens: 12000
global_search:
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
Logs and screenshots
logs with some additional print statements: PS D:\WORK\PROJECTS\Python-POC\GraphRAG> python -m graphrag.query --root ./ragtest --method global "Who is antariksh"
** args: None ** data dir is NONE ** folders: [WindowsPath('ragtest/output/20240717-141348'), WindowsPath('ragtest/output/20240717-131540'), WindowsPath('ragtest/output/files'), WindowsPath('ragtest/output/20240717-131330')]
INFO: Reading settings from ragtest\settings.yaml
**data_dir D:\WORK\PROJECTS\Python-POC\GraphRAG\ragtest\output\20240717-141348\artifacts -> invalid path
**root_dir ./ragtest
**config {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_chat",
"model": "mistral",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://127.0.0.1:11434/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"root_dir": "./ragtest",
"reporting": {
"type": "file",
"base_dir": "output/files/reports",
"connection_string": null,
"container_name": null,
"storage_account_blob_url": null
},
"storage": {
"type": "file",
"base_dir": "output/files/artifacts",
"connection_string": null,
"container_name": null,
"storage_account_blob_url": null
},
"cache": {
"type": "file",
"base_dir": "cache",
"connection_string": null,
"container_name": null,
"storage_account_blob_url": null
},
"input": {
"type": "file",
"file_type": "text",
"base_dir": "input",
"connection_string": null,
"storage_account_blob_url": null,
"container_name": null,
"encoding": "utf-8",
"file_pattern": ".*\.txt$",
"file_filter": null,
"source_column": null,
"timestamp_column": null,
"timestamp_format": null,
"text_column": "text",
"title_column": null,
"document_attribute_columns": []
},
"embed_graph": {
"enabled": false,
"num_walks": 10,
"walk_length": 40,
"window_size": 2,
"iterations": 3,
"random_seed": 597832,
"strategy": null
},
"embeddings": {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_embedding",
"model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://localhost:1234/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": null,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"batch_size": 16,
"batch_max_tokens": 8191,
"target": "required",
"skip": [],
"vector_store": null,
"strategy": null
},
"chunks": {
"size": 400,
"overlap": 100,
"group_by_columns": [
"id"
],
"strategy": null
},
"snapshots": {
"graphml": false,
"raw_entities": false,
"top_level_nodes": false
},
"entity_extraction": {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_chat",
"model": "mistral",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://127.0.0.1:11434/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": "prompts/entity_extraction.txt",
"entity_types": [
"organization",
"person",
"geo",
"event"
],
"max_gleanings": 0,
"strategy": null
},
"summarize_descriptions": {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_chat",
"model": "mistral",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://127.0.0.1:11434/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": "prompts/summarize_descriptions.txt",
"max_length": 200,
"strategy": null
},
"community_reports": {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_chat",
"model": "mistral",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://127.0.0.1:11434/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": null,
"max_length": 2000,
"max_input_length": 8000,
"strategy": null
},
"claim_extraction": {
"llm": {
"api_key": "<API_KEY>",
"type": "openai_chat",
"model": "mistral",
"max_tokens": 4000,
"request_timeout": 180.0,
"api_base": "http://127.0.0.1:11434/v1",
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"enabled": false,
"prompt": "prompts/claim_extraction.txt",
"description": "Any claims or facts that could be relevant to information discovery.",
"max_gleanings": 0,
"strategy": null
},
"cluster_graph": {
"max_cluster_size": 10,
"strategy": null
},
"umap": {
"enabled": false
},
"local_search": {
"text_unit_prop": 0.5,
"community_prop": 0.1,
"conversation_history_max_turns": 5,
"top_k_entities": 10,
"top_k_relationships": 10,
"max_tokens": 12000,
"llm_max_tokens": 2000
},
"global_search": {
"max_tokens": 12000,
"data_max_tokens": 12000,
"map_max_tokens": 1000,
"reduce_max_tokens": 2000,
"concurrency": 32
},
"encoding_model": "cl100k_base",
"skip_workflows": []
}
Traceback (most recent call last):
File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\graphrag\query_main.py", line 84, in
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues: