graphrag
graphrag copied to clipboard
[Bug]: empty workflows list, no indexing is done
Describe the bug
Using in google colab. I used several different settings.yaml files to try to get it to work, including initial stock with .env file. One time starting in a new folder from scratch it worked partly (errored out before all workflow tasks done), but then after problem persists. I can see no pattern for the cause. please see indexing-engine.log
Steps to reproduce
- use google colab to run.
- pip install graphrag.
note, error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.0 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.
- run indexing using several different settings.yaml, with combinations of using .env or directly entering config in settings file. Including stock settings.yaml
- empty artifacts folder, no workflow tasks done
Expected Behavior
workflow list should be fully populated and all tasks run correctly. At best have only had a few partial runs, now nothing is done
GraphRAG Config Used
encoding_model: cl100k_base
# encoding_model: ${GRAPHRAG_ENCODING_MODEL}
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
# model: gpt-4-turbo-preview
model: ${GRAPHRAG_MODEL}
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
# api_base: ${GRAPHRAG_API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 5 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
# model: text-embedding-3-small
model: ${GRAPHRAG_EMBEDDING_MODEL}
# api_base: ${GRAPHRAG_API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
indexing-engine.log
14:57:04,749 graphrag.index.run INFO Running pipeline with config settings.yaml
14:57:04,751 graphrag.config.read_dotenv INFO Loading pipeline .env file
14:57:05,473 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output/20240710-145704/artifacts
14:57:05,482 graphrag.index.input.load_input INFO loading input from root_dir=input
14:57:05,482 graphrag.index.input.load_input INFO using file storage for input
14:57:05,486 graphrag.index.storage.file_pipeline_storage INFO search /content/drive/MyDrive/.../2024-07-10/input for files matching .*\.txt$
14:57:05,488 graphrag.index.input.text INFO found text files from input, found [('wildfly_jira_compact_3.txt', {}), ('wildfly_jira_compact_2.txt', {}), ('wildfly_jira_compact_1.txt', {})]
14:57:05,504 graphrag.index.workflows.load INFO Workflow Run Order: []
14:57:05,505 graphrag.index.run INFO Final # of rows loaded: 3
Additional Information
- GraphRAG Version: 0.1.1
- Operating System: Ubuntu 22.04
- Python Version: 3.10.12
- Related Issues:
I removed the cache, double checked .env, and tried the following minimal settings.yaml, and still same error.
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: meta-llama/Llama-3-8b-chat-hf
api_base: https://api.together.xyz/v1
embeddings:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: togethercomputer/m2-bert-80M-2k-retrieval
api_base: https://api.together.xyz/v1
chunks:
size: 300
overlap: 100
group_by_columns: [id]
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 0
Are you trying to run indexing using the command line interface? i.e. python -m graphrag.indexing ...?
I added a change last week that should address your problem. A new command line flag --overlay-defaults will be available that inherits default values (i.e. the workflow steps that are missing from your yaml) in addition to the custom values that your config has declared.
You can either build the python package from source (run poetry build from the root directory of this repo and re-install the wheel) or wait until the next release to start using this new feature.
right, I should have specified that.
python -m graphrag.index --config <some-settings.yaml> --root .
To be clear I tried multiple settings.yaml files including ones that spedified execution of all work units. All resulted in no workflow steps.
I'm installing via main branch and can use --overlay-defaults.
pip install git+https://github.com/microsoft/graphrag@main
But settings are being ignored still. --overlay-defaults seems to act as a bandaid for some settings. For example, when I add
embed_graph:
enabled: true # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap:
enabled: true # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
embed_graph, graphml, raw_entities, umap, and top_level_nodes are not being generated.
additionally, when I try a local search there seems to be missing lancedb dataset, see first line below. The last line I wonder if that is an issue with trying to run in colab and maybe a separate issue.
[2024-07-12T14:44:16Z WARN lance::dataset] No existing dataset at /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance, it will be created
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/__main__.py", line 76, in <module>
run_local_search(
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/cli.py", line 132, in run_local_search
store_entity_semantic_embeddings(
File "/usr/local/lib/python3.10/dist-packages/graphrag/query/input/loaders/dfs.py", line 91, in store_entity_semantic_embeddings
vectorstore.load_documents(documents=documents)
File "/usr/local/lib/python3.10/dist-packages/graphrag/vector_stores/lancedb.py", line 55, in load_documents
self.document_collection = self.db_connection.create_table(
File "/usr/local/lib/python3.10/dist-packages/lancedb/db.py", line 418, in create_table
tbl = LanceTable.create(
File "/usr/local/lib/python3.10/dist-packages/lancedb/table.py", line 1545, in create
lance.write_dataset(empty, tbl._dataset_uri, schema=schema, mode=mode)
File "/usr/local/lib/python3.10/dist-packages/lance/dataset.py", line 2506, in write_dataset
inner_ds = _write_dataset(reader, uri, params)
OSError: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/.tmp_1.manifest_add4893a-5209-4899-81ae-c25465719626 to /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/1.manifest: Function not implemented (os error 38), /home/runner/work/lance/lance/rust/lance-table/src/io/commit.rs:692:54
@jgbradley1 the issue is still not fixed, only partially, several artifacts are still not produced - the settings file is being, at least partly, ignored. Is there some issue with maybe say whitespace malformed yaml? Just guessing now
As posted above,
embed_graph:
enabled: true # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap:
enabled: true # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
embed_graph, graphml, raw_entities, umap, and top_level_nodes are not being generated.
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
Having a similar issue, default settings.yaml works fine. I tried the prompt tuning, and put the tuned prompts inside prompts_tuned folder, I copied the settings.yaml to settings_prompts_tuned.yaml and update all the prompt, cache, output paths, when I index there are two issues: 1. empty workflow, 2. indexing-engine.log still generated inside output folder instead of output_prompts_tuned folder, while logs.json is generated inside output_prompts_tuned folder.
After a bit debugging, I found that --config and --overlay-defaults have to be used together, only use --config will cause empty workflow issue. also indexing-engine.log path is hard coded into output folder in _enable_logging() function. My experiment is based at commit c749fe2.