[Bug]: Search fails with path errors with custom storage and reporting paths

Open TheBugKing opened this issue 1 year ago • 1 comments

Describe the bug

I am trying to have a single folder where I can store indexed artifacts for the documents. I do not want to create different indexed folders by timestamps because when I add new documents, a new folder is generated. Consequently, when I want to perform a query, I do not know which index artifact to point to and load for the response.

As per discussion 354, we can add new documents and rerun the indexer, which will add new data to the summaries. Therefore, I modified the paths for storage and reports, which led to the issues mentioned below.

from base_dir: "output/${timestamp}/artifacts" to

storage: type: file # or blob base_dir: "output/files/artifacts"
reporting: type: file # or console, blob base_dir: "output/files/reports"

issues: 1. indexing engine logs still generating under timestamp folder, which leads to issues ahead while performing queries with local and global searches.

case A: performing global and local searchs fails as it is not able to locate or load the correct path. when storage base_dir: "output/artifacts", it seems that the returned path is output/artifacts/artifact which does not exists at all.

def _infer_data_dir(root: str) -> str: output = Path(root) / "output" # use the latest data-run folder if output.exists(): folders = sorted(output.iterdir(), key=os.path.getmtime, reverse=True) if len(folders) > 0: folder = folders[0] return str((folder / "artifacts").absolute()) msg = f"Could not infer data directory from root={root}" raise ValueError(msg)

case B: When storage base_dir: "output/files/artifacts", and if there are multiple runs, the indexing-engine.log is still generated under timestamp folders. According to the logic, the folders are always sorted by the last modified timestamp. Now, the sorted folders are: Folders:

WindowsPath('ragtest/output/20240717-131540')
WindowsPath('ragtest/output/files')
WindowsPath('ragtest/output/20240717-131330')

folder = folders[0], returns ragtest/output/20240717-131540/artifacts which is invalid

Result: local and global searches errors out due to path issues Note: I am running my llm models using llm studio and ollama locally to save costs.

Steps to reproduce

modify paths from base_dir: "output/${timestamp}/artifacts" to storage: type: file # or blob base_dir: "output/files/artifacts"

reporting: type: file # or console, blob base_dir: "output/files/reports"
run indexer tow or more times
run a globar search

Expected Behavior

all reports and artifacts should be generated in the specified paths
paths should be loaded as per the paths modifed or mentioned in the settings, may or may not include timestamps
end goals is to be able to add new documents over existing indexed data and avoid running indexing on entire documents again, if we need to use a timestamp based output artifacts how can one query all over the old and new indexed data

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: mistral model_supports_json: true # recommended if this is available for your model.

parallelization: stagger: 0.3

async_mode: threaded # or asyncio

embeddings: async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf

chunks: size: 400 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

cache: type: file # or blob base_dir: "cache"

storage: type: file # or blob base_dir: "output/files/artifacts"

reporting: type: file # or console, blob base_dir: "output/files/reports"

entity_extraction: prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0

summarize_descriptions: prompt: "prompts/summarize_descriptions.txt" max_length: 200

claim_extraction: prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0

community_report: prompt: "prompts/community_report.txt" max_length: 1000 max_input_length: 3000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false raw_entities: false top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

global_search:

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

logs with some additional print statements: PS D:\WORK\PROJECTS\Python-POC\GraphRAG> python -m graphrag.query --root ./ragtest --method global "Who is antariksh"

** args: None ** data dir is NONE ** folders: [WindowsPath('ragtest/output/20240717-141348'), WindowsPath('ragtest/output/20240717-131540'), WindowsPath('ragtest/output/files'), WindowsPath('ragtest/output/20240717-131330')]

INFO: Reading settings from ragtest\settings.yaml **data_dir D:\WORK\PROJECTS\Python-POC\GraphRAG\ragtest\output\20240717-141348\artifacts -> invalid path **root_dir ./ragtest **config { "llm": { "api_key": "<API_KEY>", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "root_dir": "./ragtest", "reporting": { "type": "file", "base_dir": "output/files/reports", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "storage": { "type": "file", "base_dir": "output/files/artifacts", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "cache": { "type": "file", "base_dir": "cache", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "input": { "type": "file", "file_type": "text", "base_dir": "input", "connection_string": null, "storage_account_blob_url": null, "container_name": null, "encoding": "utf-8", "file_pattern": ".*\.txt$", "file_filter": null, "source_column": null, "timestamp_column": null, "timestamp_format": null, "text_column": "text", "title_column": null, "document_attribute_columns": [] }, "embed_graph": { "enabled": false, "num_walks": 10, "walk_length": 40, "window_size": 2, "iterations": 3, "random_seed": 597832, "strategy": null }, "embeddings": { "llm": { "api_key": "<API_KEY>", "type": "openai_embedding", "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://localhost:1234/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": null, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "batch_size": 16, "batch_max_tokens": 8191, "target": "required", "skip": [], "vector_store": null, "strategy": null }, "chunks": { "size": 400, "overlap": 100, "group_by_columns": [ "id" ], "strategy": null }, "snapshots": { "graphml": false, "raw_entities": false, "top_level_nodes": false }, "entity_extraction": { "llm": { "api_key": "<API_KEY>", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/entity_extraction.txt", "entity_types": [ "organization", "person", "geo", "event" ], "max_gleanings": 0, "strategy": null }, "summarize_descriptions": { "llm": { "api_key": "<API_KEY>", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/summarize_descriptions.txt", "max_length": 200, "strategy": null }, "community_reports": { "llm": { "api_key": "<API_KEY>", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": null, "max_length": 2000, "max_input_length": 8000, "strategy": null }, "claim_extraction": { "llm": { "api_key": "<API_KEY>", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "enabled": false, "prompt": "prompts/claim_extraction.txt", "description": "Any claims or facts that could be relevant to information discovery.", "max_gleanings": 0, "strategy": null }, "cluster_graph": { "max_cluster_size": 10, "strategy": null }, "umap": { "enabled": false }, "local_search": { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "top_k_entities": 10, "top_k_relationships": 10, "max_tokens": 12000, "llm_max_tokens": 2000 }, "global_search": { "max_tokens": 12000, "data_max_tokens": 12000, "map_max_tokens": 1000, "reduce_max_tokens": 2000, "concurrency": 32 }, "encoding_model": "cl100k_base", "skip_workflows": [] } Traceback (most recent call last): File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\graphrag\query_main.py", line 84, in run_global_search( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\graphrag\query\cli.py", line 71, in run_global_search final_nodes: pd.DataFrame = pd.read_parquet( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 667, in read_parquet return impl.read( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 267, in read path_or_handle, handles, filesystem = _get_path_or_handle( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 140, in _get_path_or_handle handles = get_handle( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\common.py", line 882, in get_handle handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'D:\WORK\PROJECTS\Python-POC\GraphRAG\ragtest\output\20240717-141348\artifacts\create_final_nodes.parquet' PS D:\WORK\PROJECTS\Python-POC\GraphRAG>

Additional Information

GraphRAG Version:
Operating System:
Python Version:
Related Issues:

Jul 17 '24 08:07 TheBugKing