graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: <title>whether GraphRAG has a file size limit for uploads?

Open hope12122 opened this issue 11 months ago • 0 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.

Steps to reproduce

uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.

Expected Behavior

support large book indexing

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm: api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file type: openai_chat # or azure_openai_chat model: gpt-4o-mini model_supports_json: true # recommended if this is available for your model.

audience: "https://cognitiveservices.azure.com/.default"

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

parallelization: stagger: 0.3

num_threads: 50

async_mode: threaded # or asyncio

embeddings: async_mode: threaded # or asyncio vector_store: type: lancedb db_uri: 'output/lancedb' container_name: default overwrite: true llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: text-embedding-3-small # api_base: https://.openai.azure.com # api_version: 2024-02-15-preview # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> # deployment_name: <azure_model_deployment_name>

Input settings

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

chunks: size: 1200 overlap: 100 group_by_columns: [id]

Storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

cache: type: file # or blob base_dir: "cache"

reporting: type: file # or console, blob base_dir: "logs"

storage: type: file # or blob base_dir: "output"

only turn this on if running graphrag index with custom settings

we normally use graphrag update with the defaults

update_index_storage: #type: file # or blob #base_dir: "vv"

Workflow settings

skip_workflows: []

entity_extraction: prompt: "prompts/entity_extraction.txt" entity_types: [organization, person, geo, event] max_gleanings: 1

summarize_descriptions: prompt: "prompts/summarize_descriptions.txt" max_length: 500

claim_extraction: enabled: false prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1

community_reports: prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false embeddings: false transient: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search: prompt: "prompts/local_search_system_prompt.txt"

global_search: map_prompt: "prompts/global_search_map_system_prompt.txt" reduce_prompt: "prompts/global_search_reduce_system_prompt.txt" knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search: prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

Image

10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows 10:26:28,71 graphrag.index.run.run INFO Running pipeline 10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output 10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input 10:26:28,72 graphrag.index.input.factory INFO using file storage for input 10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$ 10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})] 10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping... 10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0 10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings'] 10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0 10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: [] 10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units 10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id' Traceback (most recent call last): File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow output = await create_base_text_units( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units sort = documents.sort_values(by=["id"], ascending=[True]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values k = self._get_label_or_level_values(by[0], axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values raise KeyError(key) KeyError: 'id' 10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None 10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units Traceback (most recent call last): File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline result = await _process_workflow( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow output = await create_base_text_units( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units sort = documents.sort_values(by=["id"], ascending=[True]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values k = self._get_label_or_level_values(by[0], axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values raise KeyError(key) KeyError: 'id'

Additional Information

  • GraphRAG Version:1.0.0
  • Operating System:ubantu
  • Python Version:3.12
  • Related Issues:

hope12122 avatar Jan 15 '25 02:01 hope12122