[Bug]: PyArrow Capacity Limit
Do you need to file an issue?
- [X] I have searched the existing issues and this bug is not already filed.
- [X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
At the end of the "create base extracted enities" workflow, graphrag stored a large gaphml file as a string within a pandas dataframe. Later on, it tries to save the file as a parquet file which triggers the fololwing exception:
- pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2612524437
This is a limitation of the pyarrow as it cannot handle strings larger than 2 GB by default. Their sources recommend using pyarrow.large_string, but further research needs to be done: References:
Steps to reproduce
Option 1:
Create an index using a dataset than contains 160 million tokens.
Option 2:
Create a dataframe with one row that contains a string larger than 2GB. Below is a script to reproduce the problem:
import string
rows = int(3 * 2**30/len(string.ascii_lowercase.encode('utf-8')))
rows = int(np.ceil(np.log2(rows))) # exponential growth
row_data = string.ascii_lowercase
for _ in tqdm(range(rows)):
row_data += row_data
df = pd.DataFrame(
{'A': [row_data]}
)
# Compute the size of the DataFrame in bytes
df_memory_usage = df.memory_usage(deep=True).sum()
df_memory_usage_mb = df_memory_usage / 2**20 # Convert bytes to megabytes
print(f'Total memory usage of the DataFrame is: {df_memory_usage_mb:.2f} MB')
print(f"df.dtypes: {df.dtypes}")
display(df.head())
# This line will raise a PyArrow capacity error
df.to_parquet("max_capacity.parquet")
Expected Behavior
If you follow option one. You should see the below error message at the end of the "create base extracted enities" workflow:
File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 261264517716:37:55,764
graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
If you executed option 2, you will see the following error:
ArrowCapacityError Traceback (most recent call last)
Cell In[5], line 24
21 display(df.head())
23 # This line will raise a PyArrow capacity error
---> 24 df.to_parquet(\"max_capacity.parquet\")
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
327 if len(args) > num_allow_args:
328 warnings.warn(
329 msg.format(arguments=_format_argument_list(allow_args)),
330 FutureWarning,
331 stacklevel=find_stack_level(),
332 )
--> 333 return func(*args, **kwargs)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py:3113, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
3032 \"\"\"
3033 Write a DataFrame to the binary parquet format.
3034
(...)
3109 >>> content = f.read()
3110 \"\"\"
3111 from pandas.io.parquet import to_parquet
-> 3113 return to_parquet(
3114 self,
3115 path,
3116 engine,
3117 compression=compression,
3118 index=index,
3119 partition_cols=partition_cols,
3120 storage_options=storage_options,
3121 **kwargs,
3122 )
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:480, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, filesystem, **kwargs)
476 impl = get_engine(engine)
478 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 480 impl.write(
481 df,
482 path_or_buf,
483 compression=compression,
484 index=index,
485 partition_cols=partition_cols,
486 storage_options=storage_options,
487 filesystem=filesystem,
488 **kwargs,
489 )
491 if path is None:
492 assert isinstance(path_or_buf, io.BytesIO)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:190, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, filesystem, **kwargs)
187 if index is not None:
188 from_pandas_kwargs[\"preserve_index\"] = index
--> 190 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
192 if df.attrs:
193 df_metadata = {\"PANDAS_ATTRS\": json.dumps(df.attrs)}
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/table.pxi:3874, in pyarrow.lib.Table.from_pandas()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
606 return (isinstance(arr, np.ndarray) and
607 arr.flags.contiguous and
608 issubclass(arr.dtype.type, np.integer))
610 if nthreads == 1:
--> 611 arrays = [convert_column(c, f)
612 for c, f in zip(columns_to_convert, convert_fields)]
613 else:
614 arrays = []
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in <listcomp>(.0)
606 return (isinstance(arr, np.ndarray) and
607 arr.flags.contiguous and
608 issubclass(arr.dtype.type, np.integer))
610 if nthreads == 1:
--> 611 arrays = [convert_column(c, f)
612 for c, f in zip(columns_to_convert, convert_fields)]
613 else:
614 arrays = []
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:592, in dataframe_to_arrays.<locals>.convert_column(col, field)
589 type_ = field.type
591 try:
--> 592 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
593 except (pa.ArrowInvalid,
594 pa.ArrowNotImplementedError,
595 pa.ArrowTypeError) as e:
596 e.args += (\"Conversion failed for column {!s} with type {!s}\"
597 .format(col.name, col.dtype),)
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:339, in pyarrow.lib.array()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:85, in pyarrow.lib._ndarray_to_array()
File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3489660928"
GraphRAG Config Used
# Define anchors to be reused
openai_api_key: &openai_api_key ${OPENAI_API_KEY}
#######################
# pipeline parameters #
#######################
# data inputs
input:
type: file
file_type: text
file_pattern: .*\.txt$
base_dir: ./data
# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo
# text chunking
chunks:
size: &chunk_size 800 # 800 tokens (about 3200 characters)
overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
strategy:
type: tokens
chunk_size: *chunk_size
chunk_overlap: *chunk_overlap
encoding_name: *encoding_name
# chat llm inputs
llm: &chat_llm
api_key: *openai_api_key
type: openai_chat
model: gpt-4o-mini
max_tokens: 4096
request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
api_version: "2024-02-01"
# deployment_name: gpt-4o-mini
model_supports_json: true
tokens_per_minute: 150000000
requests_per_minute: 30000
max_retries: 20
max_retry_wait: 10
sleep_on_rate_limit_recommendation: true
concurrent_requests: 50
parallelization: ¶llelization
stagger: 0.25
num_threads: 100
async_mode: &async_mode asyncio
# async_mode: &async_mode threaded
entity_extraction:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
prompt: ./prompts/entity_extraction.txt
max_gleanings: 1
summarize_descriptions:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
prompt:. ./prompts/summarize_descriptions.txt
max_length: 500
community_reports:
llm: *chat_llm
parallelization: *parallelization
async_mode: *async_mode
prompt: ./prompts/community_report.txt
max_length: &max_report_length 2000
max_input_length: 8000
# embeddings llm inputs
embeddings:
llm:
api_key: *openai_api_key
type: openai_embedding
model: text-embedding-ada-002
request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
api_version: "2024-02-01"
# deployment_name: text-embedding-ada-002
model_supports_json: false
tokens_per_minute: 10000000
requests_per_minute: 10000
max_retries: 20
max_retry_wait: 10
sleep_on_rate_limit_recommendation: true
concurrent_requests: 50
parallelization: *parallelization
async_mode: *async_mode
batch_size: 16
batch_max_tokens: 8191
vector_store:
type: lancedb
overwrite: true
db_uri: ./index/storage/lancedb
query_collection_name: entity_description_embeddings
cache:
type: file
base_dir: ./index/cache
storage:
type: file
base_dir: ./index/storage
reporting:
type: file
base_dir: ./index/reporting
snapshots:
graphml: true
raw_entities: true
top_level_nodes: true
#####################################
# orchestration (query) definitions #
#####################################
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_entities: 10
top_k_relationships: 10
temperature: 0.0
top_p: 1.0
n: 1
max_tokens: 12000
llm_max_tokens: 2000
global_search:
temperature: 0.0
top_p: 1.0
n: 1
max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 50
Logs and screenshots
No response
Additional Information
- GraphRAG Version: 0.3.0
- Operating System: 22.04.1-Ubuntu
- Python Version: 3.11.5
- Related Issues: N/A