[Bug]: PyArrow Capacity Limit

Open nievespg1 opened this issue 1 year ago • 0 comments

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

At the end of the "create base extracted enities" workflow, graphrag stored a large gaphml file as a string within a pandas dataframe. Later on, it tries to save the file as a parquet file which triggers the fololwing exception:

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2612524437

This is a limitation of the pyarrow as it cannot handle strings larger than 2 GB by default. Their sources recommend using pyarrow.large_string, but further research needs to be done: References:

Steps to reproduce

Option 1:

Create an index using a dataset than contains 160 million tokens.

Option 2:

Create a dataframe with one row that contains a string larger than 2GB. Below is a script to reproduce the problem:

import string

rows = int(3 * 2**30/len(string.ascii_lowercase.encode('utf-8')))
rows = int(np.ceil(np.log2(rows))) # exponential growth

row_data = string.ascii_lowercase
for _ in tqdm(range(rows)):
    row_data += row_data

df = pd.DataFrame(
    {'A': [row_data]}
)

# Compute the size of the DataFrame in bytes
df_memory_usage = df.memory_usage(deep=True).sum()
df_memory_usage_mb = df_memory_usage / 2**20 # Convert bytes to megabytes

print(f'Total memory usage of the DataFrame is: {df_memory_usage_mb:.2f} MB')
print(f"df.dtypes: {df.dtypes}")

display(df.head())

# This line will raise a PyArrow capacity error
df.to_parquet("max_capacity.parquet")

Expected Behavior

If you follow option one. You should see the below error message at the end of the "create base extracted enities" workflow:

File "pyarrow/array.pxi", line 339, in pyarrow.lib.array  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array  
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 261264517716:37:55,764 
graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

If you executed option 2, you will see the following error:

ArrowCapacityError                        Traceback (most recent call last)
Cell In[5], line 24
     21 display(df.head())
     23 # This line will raise a PyArrow capacity error
---> 24 df.to_parquet(\"max_capacity.parquet\")

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    327 if len(args) > num_allow_args:
    328     warnings.warn(
    329         msg.format(arguments=_format_argument_list(allow_args)),
    330         FutureWarning,
    331         stacklevel=find_stack_level(),
    332     )
--> 333 return func(*args, **kwargs)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py:3113, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   3032 \"\"\"
   3033 Write a DataFrame to the binary parquet format.
   3034 
   (...)
   3109 >>> content = f.read()
   3110 \"\"\"
   3111 from pandas.io.parquet import to_parquet
-> 3113 return to_parquet(
   3114     self,
   3115     path,
   3116     engine,
   3117     compression=compression,
   3118     index=index,
   3119     partition_cols=partition_cols,
   3120     storage_options=storage_options,
   3121     **kwargs,
   3122 )

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:480, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, filesystem, **kwargs)
    476 impl = get_engine(engine)
    478 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 480 impl.write(
    481     df,
    482     path_or_buf,
    483     compression=compression,
    484     index=index,
    485     partition_cols=partition_cols,
    486     storage_options=storage_options,
    487     filesystem=filesystem,
    488     **kwargs,
    489 )
    491 if path is None:
    492     assert isinstance(path_or_buf, io.BytesIO)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py:190, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, filesystem, **kwargs)
    187 if index is not None:
    188     from_pandas_kwargs[\"preserve_index\"] = index
--> 190 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    192 if df.attrs:
    193     df_metadata = {\"PANDAS_ATTRS\": json.dumps(df.attrs)}

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/table.pxi:3874, in pyarrow.lib.Table.from_pandas()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    606     return (isinstance(arr, np.ndarray) and
    607             arr.flags.contiguous and
    608             issubclass(arr.dtype.type, np.integer))
    610 if nthreads == 1:
--> 611     arrays = [convert_column(c, f)
    612               for c, f in zip(columns_to_convert, convert_fields)]
    613 else:
    614     arrays = []

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:611, in <listcomp>(.0)
    606     return (isinstance(arr, np.ndarray) and
    607             arr.flags.contiguous and
    608             issubclass(arr.dtype.type, np.integer))
    610 if nthreads == 1:
--> 611     arrays = [convert_column(c, f)
    612               for c, f in zip(columns_to_convert, convert_fields)]
    613 else:
    614     arrays = []

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py:592, in dataframe_to_arrays.<locals>.convert_column(col, field)
    589     type_ = field.type
    591 try:
--> 592     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    593 except (pa.ArrowInvalid,
    594         pa.ArrowNotImplementedError,
    595         pa.ArrowTypeError) as e:
    596     e.args += (\"Conversion failed for column {!s} with type {!s}\"
    597                .format(col.name, col.dtype),)

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:339, in pyarrow.lib.array()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/array.pxi:85, in pyarrow.lib._ndarray_to_array()

File ~/anaconda3/envs/graphrag/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3489660928"

GraphRAG Config Used

# Define anchors to be reused
openai_api_key: &openai_api_key ${OPENAI_API_KEY}

#######################
# pipeline parameters # 
#######################

# data inputs
input:
  type: file
  file_type: text
  file_pattern: .*\.txt$
  base_dir: ./data

# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo

# text chunking
chunks:
  size: &chunk_size 800 # 800 tokens (about 3200 characters)
  overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
  strategy:
      type: tokens
      chunk_size: *chunk_size
      chunk_overlap: *chunk_overlap
      encoding_name: *encoding_name

# chat llm inputs
llm: &chat_llm
  api_key: *openai_api_key
  type: openai_chat
  model: gpt-4o-mini
  max_tokens: 4096
  request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
  api_version: "2024-02-01"
  # deployment_name: gpt-4o-mini
  model_supports_json: true
  tokens_per_minute: 150000000
  requests_per_minute: 30000
  max_retries: 20
  max_retry_wait: 10
  sleep_on_rate_limit_recommendation: true
  concurrent_requests: 50

parallelization: &parallelization
  stagger: 0.25
  num_threads: 100

async_mode: &async_mode asyncio
# async_mode: &async_mode threaded

entity_extraction:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: ./prompts/entity_extraction.txt
  max_gleanings: 1

summarize_descriptions:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt:. ./prompts/summarize_descriptions.txt
  max_length: 500

community_reports:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: ./prompts/community_report.txt
  max_length: &max_report_length 2000
  max_input_length: 8000

# embeddings llm inputs
embeddings:
  llm:
    api_key: *openai_api_key
    type: openai_embedding
    model: text-embedding-ada-002
    request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
    api_version: "2024-02-01"
    # deployment_name: text-embedding-ada-002
    model_supports_json: false
    tokens_per_minute: 10000000
    requests_per_minute: 10000
    max_retries: 20
    max_retry_wait: 10
    sleep_on_rate_limit_recommendation: true
    concurrent_requests: 50
  parallelization: *parallelization
  async_mode: *async_mode
  batch_size: 16
  batch_max_tokens: 8191
  vector_store: 
      type: lancedb
      overwrite: true
      db_uri: ./index/storage/lancedb
      query_collection_name: entity_description_embeddings
  
cache:
  type: file
  base_dir: ./index/cache

storage:
  type: file
  base_dir: ./index/storage

reporting:
  type: file
  base_dir: ./index/reporting

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

#####################################
# orchestration (query) definitions # 
#####################################
local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_entities: 10
  top_k_relationships: 10
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  llm_max_tokens: 2000

global_search:
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 50

Logs and screenshots

No response

Additional Information

GraphRAG Version: 0.3.0
Operating System: 22.04.1-Ubuntu
Python Version: 3.11.5
Related Issues: N/A

Aug 26 '24 21:08 nievespg1