[feature] Mock LLM by record and replay responses
Feature request
Components containing LLM are hard to be unit-tested, because their output is not deterministic, and they rely on API which could fail.
So I propose a method to mock LLM output by simply recording and replaying the responses.
Motivation
It could be helpful in TDD-based workflow, in which we want to do refactoring, without changing the behavior.
Your contribution
I've made an example in my personal project, which dumps output to JSON file.
The implementation:
class MockOpenAI(OpenAI):
from_file: Path = None
to_file: Path = None
records: List[LLMResult] = []
# it overrides the generate() method
https://github.com/ofey404/WalkingShadows/blob/2cd39f6286193845ba3018bb2bcd42a7ff736fe9/src/backend/services/world/internal/llm/llm.py#L18-L21
The usage:
MockOpenAI(
# to_file=Path(__file__).parent / "test_world.json"
from_file=Path(__file__).parent
/ "test_world.json"
)
https://github.com/ofey404/WalkingShadows/blob/2cd39f6286193845ba3018bb2bcd42a7ff736fe9/src/backend/services/world/api/world/test/test_world.py#L13C1-L17
If it's proper, I'd like to contribute it to langchain, and I would refine the interface to make it more generic.
Anyone is interested in this? I'd like to find some support from maintainers.
@hwchase17 @agola11
I believe the current FakeListLLM with BaseCallbackHandler can be utilized for the said purpose.
I believe the current
FakeListLLMwithBaseCallbackHandlercan be utilized for the said purpose.
Thank you! It's neat.
Completed
Hi, any chance you share an example? 🙏
EDIT: I have figured it out
from langchain.llms.fake import FakeListLLM
from langchain.callbacks.base import BaseCallbackHandler
from typing import Any, Dict, List, Literal, Optional, Union
class FakeListLLM(FakeListLLM):
def model_name():
return "fake-list-llm"
class CustomCallbackHandler(BaseCallbackHandler):
"""Save chain input"""
def __init__(self) -> None:
super().__init__()
self.input_dict = None
self.input_prompts = None
def on_chain_start(
self, serialized: dict[str, Any], inputs: dict[str, Any], **kwargs: Any
) -> None:
"""Run when chain starts running."""
self.input_dict = inputs
def on_llm_start(
self, serialized: dict[str, Any], prompts: List[str], **kwargs: Any
) -> None:
"""Run when LLM starts running."""
self.input_prompts = prompts
@property
def always_verbose(self) -> bool:
"""Whether to call verbose callbacks even if verbose is False."""
return True
##### Everything below this is default
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
"""Run on new LLM token. Only available when streaming is enabled."""
def on_llm_end(self, response, **kwargs: Any) -> None:
"""Run when LLM ends running."""
def on_llm_error(
self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
) -> None:
"""Run when LLM errors."""
def on_chain_end(self, outputs: dict[str, Any], **kwargs: Any) -> None:
"""Run when chain ends running."""
def on_chain_error(
self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
) -> None:
"""Run when chain errors."""
def on_tool_start(
self, serialized: dict[str, Any], input_str: str, **kwargs: Any
) -> None:
"""Run when tool starts running."""
def on_tool_end(self, output: str, **kwargs: Any) -> None:
"""Run when tool ends running."""
def on_tool_error(
self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
) -> None:
"""Run when tool errors."""
def on_text(self, text: str, **kwargs: Any) -> None:
"""Run on arbitrary text."""
def on_agent_action(self, action, **kwargs: Any) -> None:
"""Run on agent action."""
def on_agent_finish(self, finish, **kwargs: Any) -> None:
"""Run on agent end."""
then later used it as follow
FakeListLLM(
responses=["<<TESTING>>"] * 128, callbacks=[CustomCallbackHandler()]
)