Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered
I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.
Here are the details of the issue:
Small File Processing:
- Small files load quickly and the verb functions are called as expected.
Large File Processing:
- Loading a ~7GB file takes a very long time, and after one hour of waiting, the verb function (nomic_embed) has not been called.
System specs:
- I am using a machine with 128 GB of RAM.
Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:
import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
AsyncType,
TableContainer,
VerbCallbacks,
VerbInput,
derive_from_rows,
verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType
@verb(name="nomic_embed")
async def nomic_embed(
input: VerbInput,
cache: PipelineCache,
storage: PipelineStorage,
callbacks: VerbCallbacks,
column: str,
id_column: str,
to: str,
async_mode: AsyncType = AsyncType.AsyncIO,
num_threads: int = 108,
batch_size: int = 150000,
output_file: str = "embed_results.parquet",
**kwargs,
)
I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.
Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?