langsmith-sdk
langsmith-sdk copied to clipboard
Create `evaluate_backtest`
Feature request
We want to add proper backtesting support, deprecating the beta implementation of the compute_test_metrics
, etc.
Something like
def backtest_evaluate(
target: ls_eval.TARGET_T,
/,
prod_runs: Sequence[ls_schemas.Run],
*,
evaluators: Optional[Sequence[ls_eval.EVALUATOR_T]] = None,
summary_evaluators: Optional[Sequence[ls_eval.SUMMARY_EVALUATOR_T]] = None,
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
max_concurrency: Optional[int] = None,
client: Optional[Client] = None,
blocking: bool = True,
) -> ls_eval.ExperimentResults:
"""Backtest a target system or function against a sample of production traces.
Args:
target (ls_eval.TARGET_T): The target system or function to evaluate.
prod_runs (Sequence[ls_schemas.Run]): A sequence of production runs to use for backtesting.
evaluators (Optional[Sequence[ls_eval.EVALUATOR_T]]): A list of evaluators to run
on each example. Defaults to None.
summary_evaluators (Optional[Sequence[ls_eval.SUMMARY_EVALUATOR_T]]): A list of summary
evaluators to run on the entire dataset. Defaults to None.
metadata (Optional[dict]): Metadata to attach to the experiment.
Defaults to None.
experiment_prefix (Optional[str]): A prefix to provide for your experiment name.
Defaults to None.
max_concurrency (Optional[int]): The maximum number of concurrent
evaluations to run. Defaults to None.
client (Optional[Client]): The LangSmith client to use.
Defaults to None.
blocking (bool): Whether to block until the evaluation is complete.
Defaults to True.
Returns:
ls_eval.ExperimentResults: The results of the backtesting evaluation.
"""
if not prod_runs:
raise ValueError(f"""Expected a non-empty sequence of production runs. Received: {prod_runs}""")
client = client or Client()
test_dataset_name = f"backtest-{uuid.uuid4().hex[:6]}"
test_project = convert_runs_to_test(
prod_runs,
dataset_name=test_dataset_name,
client=client,
)
return ls_eval.evaluate(
target,
data=test_dataset_name,
evaluators=evaluators,
summary_evaluators=summary_evaluators,
metadata=metadata,
experiment_prefix=experiment_prefix,
max_concurrency=max_concurrency,
client=client,
blocking=blocking,
)
Motivation
Backtesting is important - we to have strong APIs for this.