"Columns must be same length as key"
Describe the bug
[Bug]:
Steps to reproduce
No response
Expected Behavior
No response
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ollama type: openai_chat # or azure_openai_chat model: gemma2 model_supports_json: true # recommended if this is available for your model.
max_tokens: 4000
request_timeout: 180.0
api_base: https://localhost:11434/v1
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
max_retries: 10
max_retry_wait: 10.0
sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization: stagger: 0.3
num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio llm: api_key: lm-studio type: openai_embedding # or azure_openai_embedding model: Publisher/Repository/nomic-embed-text-v1.5.Q5_K_M.gguf api_base: http://localhost:1234/v1 # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional
chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents
input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"
cache: type: file # or blob base_dir: "cache"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"
connection_string: <azure_blob_storage_connection_string>
container_name: <azure_blob_storage_container_name>
entity_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0
summarize_descriptions:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt" max_length: 500
claim_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
enabled: true
prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0
community_report:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000
cluster_graph: max_cluster_size: 10
embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap: enabled: false # if true, will generate UMAP embeddings for nodes
snapshots: graphml: false raw_entities: false top_level_nodes: false
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_mapped_entities: 10
top_k_relationships: 10
max_tokens: 12000
global_search:
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
this is a temp hacked solution for ollama https://github.com/s106916/graphrag
I uses the same configuration as yours, but there is the same bug. pip install graphrag mkdir -p ./ragtest/input curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt python -m graphrag.index --init --root ./ragtest change .env and setting.yaml file, the same content as yours python -m graphrag.index --root ./ragtest
raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n"
sorry, can you give a try again. Please notes that settings.yaml has altered. use updated settings.yaml
This can be solved by adjusting the overlap attribute data of chunks
This can be solved by adjusting the overlap attribute data of chunks
how to adjust?
same problem
Using Xinference for models will solve this problem.
Using Xinference for models will solve this problem.
what is Xinference and how to use it?
Using Xinference for models will solve this problem.
what is Xinference and how to use it?
Refer to: https://inference.readthedocs.io/en/latest/. Xinference is a large language model inference framework that supports LLM and embedding models, and can seamlessly call OpenAI interfaces.
Hi! We are consolidating alternate model issues here: https://github.com/microsoft/graphrag/issues/657
This can be solved by adjusting the overlap attribute data of chunks
I have tried, and set overlap to 0, 10, 50...: chunks: size: 300 # 300 overlap: 0 # 100 group_by_columns: [id] And it did not work......