graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered

Open 9prodhi opened this issue 1 year ago • 0 comments

I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.

Here are the details of the issue:

Small File Processing:

  • Small files load quickly and the verb functions are called as expected.

Large File Processing:

  • Loading a ~7GB file takes a very long time, and after one hour of waiting, the verb function (nomic_embed) has not been called.

System specs:

  • I am using a machine with 128 GB of RAM.

Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:

import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
    AsyncType,
    TableContainer,
    VerbCallbacks,
    VerbInput,
    derive_from_rows,
    verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType

@verb(name="nomic_embed")
async def nomic_embed(
    input: VerbInput,
    cache: PipelineCache,
    storage: PipelineStorage,
    callbacks: VerbCallbacks,
    column: str,
    id_column: str,
    to: str,
    async_mode: AsyncType = AsyncType.AsyncIO,
    num_threads: int = 108,
    batch_size: int = 150000,
    output_file: str = "embed_results.parquet",
    **kwargs,
)

I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.

Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?

9prodhi avatar Oct 13 '24 23:10 9prodhi