zenml icon indicating copy to clipboard operation
zenml copied to clipboard

Performance Degradation When Materializing LangChain's Document Objects

Open SasCezar opened this issue 9 months ago • 1 comments

Description

I'm experiencing a significant performance degradation when materializing a list of Document objects compared to using their JSON (dictionary) representation. Specifically, processing 200 documents takes roughly 20x longer when using List[Document] objects versus a list of dictionaries List[Dict] as return.

Code

from typing import Annotated, List, Dict
from langchain_core.documents import Document
from zenml import step, get_step_context

@step()
def chunk_docs(docs: List[Document]) -> Annotated[List[Document], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    return docs

@step()
def chunk_docs_dict(docs: List[Document]) -> Annotated[List[Dict], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    docs = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
    return docs

if __name__ == "__main__":
    num_docs = 200
    docs = [
        Document(
            page_content=f"This is the content of document {i}." * 50,
            metadata={"doc_id": i}
        )
        for i in range(num_docs)
    ]

    # Time to chunk with and without langchain
    import time

    start = time.time()
    chunked_docs = chunk_docs(docs)
    print(f"Time taken to chunk {num_docs} docs without langchain: {time.time() - start}")

    start = time.time()
    chunked_docs = chunk_docs_dict(docs)
    print(f"Time taken to chunk {num_docs} docs with langchain: {time.time() - start}")

Output

Running single step pipeline to execute step chunk_docs
...
Received 200 documents. Returning documents without changes.
Step chunk_docs has finished in 33.667s.
Pipeline run has finished in 33.725s.
Time taken to chunk 200 docs without langchain: 36.364107847213745

Running single step pipeline to execute step chunk_docs_dict
...
Received 200 documents. Returning documents without changes.
Step chunk_docs_dict has finished in 0.440s.
Pipeline run has finished in 0.487s.
Time taken to chunk 200 docs with langchain: 1.629422664642334

Expected Behavior

I expected both steps to have similar performance since both functions essentially process the same data. The conversion of a Document to a dictionary (as shown in chunk_docs_dict) appears to be much faster.

Actual Behavior

Using Document objects: ~36.36 seconds for 200 documents. Using dictionary conversion: ~1.63 seconds for 200 documents.

Environment:

langchain 0.3.19 zenml 0.74.0 python 3.11.11

Discussion

Since both steps return a list type, the BuiltInContainerMaterializer is used by default, bypassing the materializer defined in the LangChain integration. For List[Dict], the materializer uses the _is_serializable method. However, for List[Document], each item triggers a lookup in the materializer registry, which for Document selects the PydanticMaterializer.

For reference, see the following code sections:

However, besides the creation of a single file for each item in the list, I don't understand why there is a massive difference in performance.

SasCezar avatar Feb 23 '25 20:02 SasCezar

This is a known behaviour. The built-in materializers for List objects will naively save each object in the list separately, so that's n individual save operations that have to take place. To speed this up, you'll want to return something else from your step. For a langchain Document, you might want to consider a single JSON string, even, which would probably give the snappiest performance at materialization time.

strickvl avatar Feb 23 '25 21:02 strickvl

Closing this due to inactivity.

bcdurak avatar Sep 18 '25 12:09 bcdurak