[Bug]: <title>whether GraphRAG has a file size limit for uploads?

Open hope12122 opened this issue 11 months ago • 0 comments

Do you need to file an issue?

[x] I have searched the existing issues and this bug is not already filed.
[x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.

Steps to reproduce

uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.

Expected Behavior

support large book indexing

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm: api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file type: openai_chat # or azure_openai_chat model: gpt-4o-mini model_supports_json: true # recommended if this is available for your model.

audience: "https://cognitiveservices.azure.com/.default"

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

parallelization: stagger: 0.3

num_threads: 50

async_mode: threaded # or asyncio

embeddings: async_mode: threaded # or asyncio vector_store: type: lancedb db_uri: 'output/lancedb' container_name: default overwrite: true llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: text-embedding-3-small # api_base: https://.openai.azure.com # api_version: 2024-02-15-preview # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> # deployment_name: <azure_model_deployment_name>

Input settings

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

chunks: size: 1200 overlap: 100 group_by_columns: [id]

Storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

cache: type: file # or blob base_dir: "cache"

reporting: type: file # or console, blob base_dir: "logs"

storage: type: file # or blob base_dir: "output"

only turn this on if running `graphrag index` with custom settings

we normally use `graphrag update` with the defaults

update_index_storage: #type: file # or blob #base_dir: "vv"

Workflow settings

skip_workflows: []

entity_extraction: prompt: "prompts/entity_extraction.txt" entity_types: [organization, person, geo, event] max_gleanings: 1

summarize_descriptions: prompt: "prompts/summarize_descriptions.txt" max_length: 500

claim_extraction: enabled: false prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1

community_reports: prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false embeddings: false transient: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search: prompt: "prompts/local_search_system_prompt.txt"

global_search: map_prompt: "prompts/global_search_map_system_prompt.txt" reduce_prompt: "prompts/global_search_reduce_system_prompt.txt" knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search: prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows 10:26:28,71 graphrag.index.run.run INFO Running pipeline 10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output 10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input 10:26:28,72 graphrag.index.input.factory INFO using file storage for input 10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$ 10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})] 10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping... 10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0 10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings'] 10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0 10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: [] 10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units 10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id' Traceback (most recent call last): File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow output = await create_base_text_units( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units sort = documents.sort_values(by=["id"], ascending=[True]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values k = self._get_label_or_level_values(by[0], axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values raise KeyError(key) KeyError: 'id' 10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None 10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units Traceback (most recent call last): File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline result = await _process_workflow( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow output = await create_base_text_units( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units sort = documents.sort_values(by=["id"], ascending=[True]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values k = self._get_label_or_level_values(by[0], axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values raise KeyError(key) KeyError: 'id'

Additional Information

GraphRAG Version:1.0.0
Operating System:ubantu
Python Version:3.12
Related Issues:

Jan 15 '25 02:01 hope12122

[Bug]: <title>whether GraphRAG has a file size limit for uploads?

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

audience: "https://cognitiveservices.azure.com/.default"

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

num_threads: 50

Input settings

Storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

only turn this on if running graphrag index with custom settings

we normally use graphrag update with the defaults

Workflow settings

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

Logs and screenshots

Additional Information

only turn this on if running `graphrag index` with custom settings

we normally use `graphrag update` with the defaults