graphrag "Columns must be same length as key"

Describe the bug

[Bug]:

raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null} <h3>Steps to reproduce</h3> <p><em>No response</em></p> <h3>Expected Behavior</h3> <p><em>No response</em></p> <h3>GraphRAG Config Used</h3> <p>encoding_model: cl100k_base skip_workflows: [] llm: api_key: ollama type: openai_chat # or azure_openai_chat model: gemma2 model_supports_json: true # recommended if this is available for your model.</p> <h1>max_tokens: 4000</h1> <h1>request_timeout: 180.0</h1> <p>api_base: https://localhost:11434/v1</p> <h1>api_version: 2024-02-15-preview</h1> <h1>organization: <organization_id></h1> <h1>deployment_name: <azure_model_deployment_name></h1> <h1>tokens_per_minute: 150_000 # set a leaky bucket throttle</h1> <h1>requests_per_minute: 10_000 # set a leaky bucket throttle</h1> <h1>max_retries: 10</h1> <h1>max_retry_wait: 10.0</h1> <h1>sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times</h1> <h1>concurrent_requests: 25 # the number of parallel inflight requests that may be made</h1> <p>parallelization: stagger: 0.3</p> <h1>num_threads: 50 # the number of threads to use for parallel processing</h1> <p>async_mode: threaded # or asyncio</p> <p>embeddings:</p> <h2>parallelization: override the global parallelization settings for embeddings</h2> <p>async_mode: threaded # or asyncio llm: api_key: lm-studio type: openai_embedding # or azure_openai_embedding model: Publisher/Repository/nomic-embed-text-v1.5.Q5_K_M.gguf api_base: http://localhost:1234/v1 # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional</p> <p>chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents</p> <p>input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"</p> <p>cache: type: file # or blob base_dir: "cache"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>entity_extraction:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0</p> <p>summarize_descriptions:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/summarize_descriptions.txt" max_length: 500</p> <p>claim_extraction:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <h1>enabled: true</h1> <p>prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0</p> <p>community_report:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000</p> <p>cluster_graph: max_cluster_size: 10</p> <p>embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes</p> <h1>num_walks: 10</h1> <h1>walk_length: 40</h1> <h1>window_size: 2</h1> <h1>iterations: 3</h1> <h1>random_seed: 597832</h1> <p>umap: enabled: false # if true, will generate UMAP embeddings for nodes</p> <p>snapshots: graphml: false raw_entities: false top_level_nodes: false</p> <p>local_search:</p> <h1>text_unit_prop: 0.5</h1> <h1>community_prop: 0.1</h1> <h1>conversation_history_max_turns: 5</h1> <h1>top_k_mapped_entities: 10</h1> <h1>top_k_relationships: 10</h1> <h1>max_tokens: 12000</h1> <p>global_search:</p> <h1>max_tokens: 12000</h1> <h1>data_max_tokens: 12000</h1> <h1>map_max_tokens: 1000</h1> <h1>reduce_max_tokens: 2000</h1> <h1>concurrency: 32</h1> <h3>Logs and screenshots</h3> <p><em>No response</em></p> <h3>Additional Information</h3> <ul> <li>GraphRAG Version:</li> <li>Operating System:</li> <li>Python Version:</li> <li>Related Issues:</li> </ul>

Jul 12 '24 00:07 xiaobie-lhm

this is a temp hacked solution for ollama https://github.com/s106916/graphrag

Jul 13 '24 03:07 s106916

I uses the same configuration as yours, but there is the same bug. pip install graphrag mkdir -p ./ragtest/input curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt python -m graphrag.index --init --root ./ragtest change .env and setting.yaml file, the same content as yours python -m graphrag.index --root ./ragtest

raise ValueError("Columns must be same length as key")\nValueError: Columns must be same length as key\n"

Jul 14 '24 08:07 xiaobie-lhm

sorry, can you give a try again. Please notes that settings.yaml has altered. use updated settings.yaml

Jul 15 '24 02:07 s106916

This can be solved by adjusting the overlap attribute data of chunks

Jul 15 '24 07:07 linghan16

This can be solved by adjusting the overlap attribute data of chunks

how to adjust?

Jul 15 '24 07:07 ViviqwerAsd

same problem

Jul 19 '24 07:07 sadimoodi

Using Xinference for models will solve this problem.

Jul 22 '24 06:07 GW00287440

Using Xinference for models will solve this problem.

what is Xinference and how to use it?

Jul 22 '24 06:07 sadimoodi

Using Xinference for models will solve this problem.

what is Xinference and how to use it?

Refer to: https://inference.readthedocs.io/en/latest/. Xinference is a large language model inference framework that supports LLM and embedding models, and can seamlessly call OpenAI interfaces.

Jul 22 '24 07:07 GW00287440

Hi! We are consolidating alternate model issues here: https://github.com/microsoft/graphrag/issues/657

Jul 22 '24 23:07 AlonsoGuevara

This can be solved by adjusting the overlap attribute data of chunks

I have tried, and set overlap to 0, 10, 50...: chunks: size: 300 # 300 overlap: 0 # 100 group_by_columns: [id] And it did not work......

Jul 23 '24 08:07 kevinYu12138