langsmith-sdk icon indicating copy to clipboard operation
langsmith-sdk copied to clipboard

Create `evaluate_backtest`

Open hinthornw opened this issue 10 months ago • 0 comments

Feature request

We want to add proper backtesting support, deprecating the beta implementation of the compute_test_metrics, etc.

Something like

def backtest_evaluate(
    target: ls_eval.TARGET_T,
    /,
    prod_runs: Sequence[ls_schemas.Run],
    *,
    evaluators: Optional[Sequence[ls_eval.EVALUATOR_T]] = None,
    summary_evaluators: Optional[Sequence[ls_eval.SUMMARY_EVALUATOR_T]] = None,
    metadata: Optional[dict] = None,
    experiment_prefix: Optional[str] = None,
    max_concurrency: Optional[int] = None,
    client: Optional[Client] = None,
    blocking: bool = True,
) -> ls_eval.ExperimentResults:
    """Backtest a target system or function against a sample of production traces.

    Args:
        target (ls_eval.TARGET_T): The target system or function to evaluate.
        prod_runs (Sequence[ls_schemas.Run]): A sequence of production runs to use for backtesting.
        evaluators (Optional[Sequence[ls_eval.EVALUATOR_T]]): A list of evaluators to run
            on each example. Defaults to None.
        summary_evaluators (Optional[Sequence[ls_eval.SUMMARY_EVALUATOR_T]]): A list of summary
            evaluators to run on the entire dataset. Defaults to None.
        metadata (Optional[dict]): Metadata to attach to the experiment.
            Defaults to None.
        experiment_prefix (Optional[str]): A prefix to provide for your experiment name.
            Defaults to None.
        max_concurrency (Optional[int]): The maximum number of concurrent
            evaluations to run. Defaults to None.
        client (Optional[Client]): The LangSmith client to use.
            Defaults to None.
        blocking (bool): Whether to block until the evaluation is complete.
            Defaults to True.

    Returns:
        ls_eval.ExperimentResults: The results of the backtesting evaluation.
    """
    if not prod_runs:
        raise ValueError(f"""Expected a non-empty sequence of production runs. Received: {prod_runs}""")
    client = client or Client()
    test_dataset_name = f"backtest-{uuid.uuid4().hex[:6]}"
    test_project = convert_runs_to_test(
        prod_runs,
        dataset_name=test_dataset_name,
        client=client,
    )
    return ls_eval.evaluate(
        target,
        data=test_dataset_name,
        evaluators=evaluators,
        summary_evaluators=summary_evaluators,
        metadata=metadata,
        experiment_prefix=experiment_prefix,
        max_concurrency=max_concurrency,
        client=client,
        blocking=blocking,
    )

Motivation

Backtesting is important - we to have strong APIs for this.

hinthornw avatar Apr 04 '24 03:04 hinthornw