[Bug]: Pipeline Fails with --emit json Option: Unable to Find create_base_text_units.parquet Despite create_base_text_units.json Existing
Do you need to file an issue?
- [X] I have searched the existing issues and this bug is not already filed.
- [X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
When using the command poetry run poe index --verbose --emit json to index, set json format in the emit field, the pipeline fails at the create_base_text_units. The error message said that it cannot find create_base_text_units.parquet, even though create_base_text_units.json exists in the output directory.
Steps to reproduce
- after initialize, run
poetry run poe index --verbose --emit json
- the pipeline output:
⠏ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
└── create_base_text_units
❌ Errors occurred during the pipeline run, see logs for more details.
- check the logs for details:
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 320, in run_pipeline\n await inject_workflow_data_dependencies(workflow)\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 256, in inject_workflow_data_dependencies\n table = await load_table_from_storage(f\"{id}.parquet\")\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 242, in load_table_from_storage\n raise ValueError(msg)\nValueError: Could not find create_base_text_units.parquet in storage!\n", "source": "Could not find create_base_text_units.parquet in storage!", "details": null}
Expected Behavior
The pipeline should recognize and use the create_base_text_units.json file in the output directory when --emit json is specified.
GraphRAG Config Used
# Paste your config here
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini # TODO: Change
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
tokens_per_minute: 150000 # set a leaky bucket throttle TODO: Change
requests_per_minute: 10000 # set a leaky bucket throttle TODO: Change
max_retries: 10 # TODO: Change
max_retry_wait: 10.0 # TODO: Change
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 600 # TODO: Change
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
Console:
Logs:
indexing-engine.log
20:15:01,324 asyncio DEBUG Using proactor: IocpProactor
20:15:01,336 graphrag.config.read_dotenv INFO Loading pipeline .env file
20:15:01,340 graphrag.index.cli INFO using default configuration: {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"root_dir": ".",
"reporting": {
"type": "file",
"base_dir": "output/${timestamp}/reports",
"storage_account_blob_url": null
},
"storage": {
"type": "file",
"base_dir": "output/${timestamp}/artifacts",
"storage_account_blob_url": null
},
"cache": {
"type": "file",
"base_dir": "cache",
"storage_account_blob_url": null
},
"input": {
"type": "file",
"file_type": "text",
"base_dir": "input",
"storage_account_blob_url": null,
"encoding": "utf-8",
"file_pattern": ".*\\.txt$",
"file_filter": null,
"source_column": null,
"timestamp_column": null,
"timestamp_format": null,
"text_column": "text",
"title_column": null,
"document_attribute_columns": []
},
"embed_graph": {
"enabled": false,
"num_walks": 10,
"walk_length": 40,
"window_size": 2,
"iterations": 3,
"random_seed": 597832,
"strategy": null
},
"embeddings": {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_embedding",
"model": "text-embedding-3-small",
"max_tokens": 4000,
"temperature": 0,
"top_p": 1,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": null,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"batch_size": 16,
"batch_max_tokens": 8191,
"target": "required",
"skip": [],
"vector_store": null,
"strategy": null
},
"chunks": {
"size": 600,
"overlap": 100,
"group_by_columns": [
"id"
],
"strategy": null,
"encoding_model": null
},
"snapshots": {
"graphml": false,
"raw_entities": false,
"top_level_nodes": false
},
"entity_extraction": {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": "prompts/entity_extraction.txt",
"entity_types": [
"organization",
"person",
"geo",
"event"
],
"max_gleanings": 1,
"strategy": null,
"encoding_model": null
},
"summarize_descriptions": {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": "prompts/summarize_descriptions.txt",
"max_length": 500,
"strategy": null
},
"community_reports": {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"prompt": "prompts/community_report.txt",
"max_length": 2000,
"max_input_length": 8000,
"strategy": null
},
"claim_extraction": {
"llm": {
"api_key": "REDACTED, length 56",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
},
"parallelization": {
"stagger": 0.3,
"num_threads": 50
},
"async_mode": "threaded",
"enabled": false,
"prompt": "prompts/claim_extraction.txt",
"description": "Any claims or facts that could be relevant to information discovery.",
"max_gleanings": 1,
"strategy": null,
"encoding_model": null
},
"cluster_graph": {
"max_cluster_size": 10,
"strategy": null
},
"umap": {
"enabled": false
},
"local_search": {
"text_unit_prop": 0.5,
"community_prop": 0.1,
"conversation_history_max_turns": 5,
"top_k_entities": 10,
"top_k_relationships": 10,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"max_tokens": 12000,
"llm_max_tokens": 2000
},
"global_search": {
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"max_tokens": 12000,
"data_max_tokens": 12000,
"map_max_tokens": 1000,
"reduce_max_tokens": 2000,
"concurrency": 32
},
"encoding_model": "cl100k_base",
"skip_workflows": []
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using LLM Config {
"api_key": "*****",
"type": "openai_chat",
"model": "gpt-4o-mini",
"max_tokens": 4000,
"temperature": 0.0,
"top_p": 1.0,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": true,
"tokens_per_minute": 150000,
"requests_per_minute": 10000,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using Embeddings Config {
"api_key": "*****",
"type": "openai_embedding",
"model": "text-embedding-3-small",
"max_tokens": 4000,
"temperature": 0,
"top_p": 1,
"n": 1,
"request_timeout": 180.0,
"api_base": null,
"api_version": null,
"organization": null,
"proxy": null,
"cognitive_services_endpoint": null,
"deployment_name": null,
"model_supports_json": null,
"tokens_per_minute": 0,
"requests_per_minute": 0,
"max_retries": 10,
"max_retry_wait": 10.0,
"sleep_on_rate_limit_recommendation": true,
"concurrent_requests": 25
}
20:15:01,366 graphrag.index.create_pipeline_config INFO skipping workflows
20:15:01,457 graphrag.index.run INFO Running pipeline
20:15:01,457 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output\20240808-201501\artifacts
20:15:01,458 graphrag.index.input.load_input INFO loading input from root_dir=input
20:15:01,459 graphrag.index.input.load_input INFO using file storage for input
20:15:01,460 graphrag.index.storage.file_pipeline_storage INFO search input for files matching .*\.txt$
20:15:01,461 graphrag.index.input.text INFO found text files from input, found [('��ҽ���ѧ���ﲿ����������.txt', {})]
20:15:01,464 graphrag.index.input.text INFO Found 1 files, loading 1
20:15:01,466 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']
20:15:01,466 graphrag.index.run INFO Final # of rows loaded: 1
20:15:01,618 graphrag.index.run INFO Running workflow: create_base_text_units...
20:15:01,618 graphrag.index.run INFO dependencies for create_base_text_units: []
20:15:01,622 datashaper.workflow.workflow INFO executing verb orderby
20:15:01,627 datashaper.workflow.workflow INFO executing verb zip
20:15:01,631 datashaper.workflow.workflow INFO executing verb aggregate_override
20:15:01,642 datashaper.workflow.workflow INFO executing verb chunk
20:15:01,824 datashaper.workflow.workflow INFO executing verb select
20:15:01,830 datashaper.workflow.workflow INFO executing verb unroll
20:15:01,838 datashaper.workflow.workflow INFO executing verb rename
20:15:01,842 datashaper.workflow.workflow INFO executing verb genid
20:15:01,848 datashaper.workflow.workflow INFO executing verb unzip
20:15:01,853 datashaper.workflow.workflow INFO executing verb copy
20:15:01,857 datashaper.workflow.workflow INFO executing verb filter
20:15:01,888 graphrag.index.run DEBUG first row of create_base_text_units => {"id":"624d2f0eb938fbf37a6e0b818f91e50a","chunk":"\u4e8c\u3001\u671b\u9762\u8272\n\u671b\u9762\u8272\uff0c\u662f\u533b\u751f\u89c2\u5bdf\u60a3\u8005\u9762\u90e8\u989c\u8272\u4e0e\u5149\u6cfd\u3002\u989c\u8272\u5c31\u662f\u8272\u8c03\u53d8\u5316,\u5149\u6cfd\u5219\u662f\u660e\u5ea6\u53d8\u5316\u3002\u53e4 \u4eba\u628a\u989c\u8272\u5206\u4e3a\u4e94\u79cd\uff0c\u5373\u9752\u3001\u8d64\u3001\u9ec4\u3001\u767d\u3001\u9ed1,\u79f0\u4e3a\u4e94\u8272\u8bca\u3002\u4e94\u8272\u7684\u53d8\u5316\uff0c\u4ee5\u9762\u90e8\u8868\u73b0\u6700\u4e3a\u660e\u663e\u3002 \u56e0\u6b64\uff0c\u672c\u4e66\u4ee5\u671b\u9762\u8272\u6765\u9610\u8ff0\u4e94\u8272\u8bca\u7684\u5185\u5bb9\u3002\n\u636e\u9634\u9633\u4e94\u884c\u548c\u810f\u8c61\u5b66\u8bf4\u7684\u7406\u8bba,\u4e94\u810f\u5e94\u4e94\u8272\u662f:\u9752\u5e94\u809d,\u8d64\u5e94\u5fc3,\u9ec4\u5e94\u813e\uff0c\u767d\u5e94\u80ba,\u9ed1\u5e94\u80be\u3002\n\uff08-\uff09\u9762\u90e8\u4e0e\u810f\u8151\u76f8\u5173\u90e8\u4f4d\n\u9762\u90e8\u7684\u5404\u90e8\u4f4d\u5206\u5c5e\u810f\u8151,\u662f\u9762\u90e8\u671b\u8bca\u7684\u57fa\u7840\u3002\u8272\u4e0e\u90e8\u4f4d\u7ed3\u5408\u8d77\u6765\uff0c\u66f4\u80fd\u8fdb\u4e00\u6b65\u4e86\u89e3\u75c5\u60c5\u3002\n\u9762\u90e8\u5206\u810f\u8151\u90e8\u4f4d:\u6839\u636e\u300a\u7075\u67a2\u2022\u4e94\u8272\u300b\u7684\u5206\u6cd5\uff0c\u628a\u6574\u4e2a\u9762\u90e8\u7684\u540d\u79f0\u5206\u4e3a\uff1a\u9f3b\u2014\u2014\u660e\u5802\uff0c\u7709 \u95f4\u4e00\u9619\uff0c\u989d\u2014\u2014\u5ead\uff08\u989c\uff09\uff0c\u988a\u4fa7\u2014\u2014\u85e9\uff0c\u8033\u95e8\u2014\u2014\u853d\n\u6309\u7167\u4e0a\u8ff0\u540d\u79f0\u548c\u4e94\u810f\u76f8\u5173\u7684\u4f4d\u7f6e\u662f\uff1a\u5ead\u2014\u2014\u9996\u9762\uff0c\u9619\u4e0a\u2014\u2014\u54bd\u5589\uff0c\u9619\u4e2d\uff08\u5370\u5802\uff09\u2014\u2014\u80ba\uff0c \u9619\u4e0b\uff08\u4e0b\u6781\uff0c\u5c71\u6839\uff09 0,\u4e0b\u6781\u4e4b\u4e0b\uff08\u5e74\u5bff\uff09\u2014\u2014\u809d\uff0c\u809d\u90e8\u5de6\u53f3\u2014\u2014\u80c6\uff0c\u809d\u4e0b\uff08\u51c6\u5934\uff09\u4e00\u813e\uff0c \u65b9\u4e0a\uff08\u813e\u4e24\u65c1\uff09\u2014\u2014\u80c3\uff0c\u4e2d\u592e\uff08\u989d\u4e0b\uff09\u2014\u2014\u5927\u80a0\uff0c\u631f\u5927\u80a0\u2014\u2014\u80be\uff0c\u660e\u5802\uff08\u9f3b\u7aef\uff09\u4ee5\u4e0a\u2014\u2014\u5c0f\u80a0\uff0c\u660e \u5802\u4ee5\u4e0b\u2014\u2014\u8180\u80f1\u5b50\u5904\uff08\u56fe2-2\uff09\u3002\n\u53e6\u5916,\u300a\u7d20\u95ee\u2022\u523a\u70ed\u7bc7\u300b\u628a\u4e94\u810f\u4e0e\u9762\u90e8\u76f8\u5173\u90e8\u4f4d\uff0c\u5212\u5206\u4e3a\uff1a\n\u5de6\u988a\u2014\u2014\u809d,\u53f3\u988a\u2014\u2014\u80ba\uff0c\u989d\u2014\u2014\u5fc3,\u987b\u2014\u2014\u80be,\u9f3b\u2014\u2014\u813e\u3002\n\u4ee5\u4e0a\u4e24\u79cd\u65b9\u6cd5\uff0c\u539f\u5219\u4e0a\u4ee5\u524d\u4e00\u79cd\u4e3a\u4e3b\u8981\u4f9d\u636e,\u540e\u4e00\u79cd\u53ef\u4f5c\u4e34\u5e8a\u53c2\u8003\u3002\n\uff08\u56db\uff09\u5e38","chunk_id":"624d2f0eb938fbf37a6e0b818f91e50a","document_ids":["e0bd1fc8d7cf72e91cf530c38e315d74"],"n_tokens":600}
20:15:01,888 graphrag.index.emit.json_table_emitter INFO emitting JSON table create_base_text_units.json
20:15:02,82 graphrag.index.run INFO Running workflow: create_base_extracted_entities...
20:15:02,83 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units']
20:15:02,83 graphrag.index.run ERROR error running workflow create_base_extracted_entities
Traceback (most recent call last):
File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 320, in run_pipeline
await inject_workflow_data_dependencies(workflow)
File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 256, in inject_workflow_data_dependencies
table = await load_table_from_storage(f"{id}.parquet")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 242, in load_table_from_storage
raise ValueError(msg)
ValueError: Could not find create_base_text_units.parquet in storage!
20:15:02,84 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
logs.json
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 320, in run_pipeline\n await inject_workflow_data_dependencies(workflow)\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 256, in inject_workflow_data_dependencies\n table = await load_table_from_storage(f\"{id}.parquet\")\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 242, in load_table_from_storage\n raise ValueError(msg)\nValueError: Could not find create_base_text_units.parquet in storage!\n", "source": "Could not find create_base_text_units.parquet in storage!", "details": null}
Additional Information
- GraphRAG Version: 0.2.0
- Operating System: Windows 11
- Python Version: 3.11
- Related Issues:
蹲,解决了吗
Update to Improve Indexing Process
The following changes in the run.py file will help complete the indexing process:
-
Modified
load_table_from_storagefunction to handle JSON files:async def load_table_from_storage(name: str) -> pd.DataFrame: if not await storage.has(name): msg = f"Could not find {name} in storage!" raise ValueError(msg) try: log.info("read table from storage: %s", name) # Read JSON data instead of Parquet content = await storage.get(name, encoding='utf-8') json_data = [json.loads(line) for line in content.splitlines() if line.strip()] return pd.DataFrame(json_data) except Exception: log.exception("error loading table from storage: %s", name) raise -
Updated
inject_workflow_data_dependenciesfunction to use JSON files:async def inject_workflow_data_dependencies(workflow: Workflow) -> None: workflow.add_table(DEFAULT_INPUT_NAME, dataset) deps = workflow_dependencies[workflow.name] log.info("dependencies for %s: %s", workflow.name, deps) for id in deps: workflow_id = f"workflow:{id}" # Load JSON file instead of Parquet table = await load_table_from_storage(f"{id}.json") workflow.add_table(workflow_id, table)
The run_global_search and run_local_search functions still need to be updated to remove the .parquet hardcoding for the query functionality to work.
Will work on these functions update here once done.
Update to run.py to run querying process while using --emit json
Replace the following functions with below implementations to the run.py file in the query folder tested on local using the documentation example and it works perfectly fine.
run_local_search
def run_local_search(
data_dir: str | None,
root_dir: str | None,
community_level: int,
response_type: str,
query: str,
):
"""Run a local search with the given query."""
data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
data_path = Path(data_dir)
def read_json_file(file_path):
with open(file_path, 'r') as f:
return pd.DataFrame([json.loads(line) for line in f if line.strip()])
final_nodes = read_json_file(data_path / "create_final_nodes.json")
final_community_reports = read_json_file(data_path / "create_final_community_reports.json")
final_text_units = read_json_file(data_path / "create_final_text_units.json")
final_relationships = read_json_file(data_path / "create_final_relationships.json")
final_entities = read_json_file(data_path / "create_final_entities.json")
final_covariates_path = data_path / "create_final_covariates.json"
final_covariates = read_json_file(final_covariates_path) if final_covariates_path.exists() else None
vector_store_args = config.embeddings.vector_store if config.embeddings.vector_store else {}
vector_store_type = vector_store_args.get("type", VectorStoreType.LanceDB)
description_embedding_store = __get_embedding_description_store(
vector_store_type=vector_store_type,
config_args=vector_store_args,
)
entities = read_indexer_entities(final_nodes, final_entities, community_level)
store_entity_semantic_embeddings(
entities=entities, vectorstore=description_embedding_store
)
covariates = read_indexer_covariates(final_covariates) if final_covariates is not None else []
search_engine = get_local_search_engine(
config,
reports=read_indexer_reports(
final_community_reports, final_nodes, community_level
),
text_units=read_indexer_text_units(final_text_units),
entities=entities,
relationships=read_indexer_relationships(final_relationships),
covariates={"claims": covariates},
description_embedding_store=description_embedding_store,
response_type=response_type,
)
result = search_engine.search(query=query)
reporter.success(f"Local Search Response: {result.response}")
return result.response
run_global_search
def run_global_search(
data_dir: str | None,
root_dir: str | None,
community_level: int,
response_type: str,
query: str,
):
"""Run a global search with the given query."""
data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
data_path = Path(data_dir)
def read_json_file(file_path):
with open(file_path, 'r') as f:
return pd.DataFrame([json.loads(line) for line in f if line.strip()])
final_nodes: pd.DataFrame = read_json_file(data_path / "create_final_nodes.json")
final_entities: pd.DataFrame = read_json_file(data_path / "create_final_entities.json")
final_community_reports: pd.DataFrame = read_json_file(data_path / "create_final_community_reports.json")
reports = read_indexer_reports(
final_community_reports, final_nodes, community_level
)
entities = read_indexer_entities(final_nodes, final_entities, community_level)
search_engine = get_global_search_engine(
config,
reports=reports,
entities=entities,
response_type=response_type,
)
result = search_engine.search(query=query)
reporter.success(f"Global Search Response: {result.response}")
return result.response
The original source code hardcoded the emit type in some functions, causing the --emit option to have no effect in these functions. Are the two solutions mentioned above part of an official update? The code I pulled in September doesn't seem to have these changes.
Closing this as we are only emitting parquet now