promptfoo How are derived metrics defined

Hi I want to create a derived metric using a mathematical expression but its not clear to me how to average the scores between tests. What language is used for the mathematical expressions and can things like the number of tests be used e.g. sum(tests) / len(tests) ?

Feb 19 '25 12:02 nfrancis-esure

Promptfoo uses mathjs to evaluate expressions in derived metrics. You can use standard math operators for operations like averaging. For details, please see the mathjs syntax documentation.

For example, to calculate an average score you might write:

derivedMetrics:
  - name: 'AverageScore'
    value: '(metric1 + metric2 + metric3) / 3'

I just opened https://github.com/promptfoo/promptfoo/pull/3157 to try to explain them better! Would love your thoughts on it. Let us know if you have any other questions.

Feb 20 '25 05:02 mldangelo

Thanks, yes that explains it much better! Two related questions:

It doesn't seem possible to add an assertion to derived metrics so that they can pass/fail, is that correct?
I would like to create a derived metric that calculates the average latency across multiple input vars. However it seems like the derived metric can only access the pass/fail value rather than the latency milliseconds

Feb 20 '25 09:02 nfrancis-esure

Also the javascript example was great and I think it might be able to be used for the average latency use case, but it would help if we could say how all the metrics such as accuracy, speed and difficulty were defined in the YAML

Feb 20 '25 14:02 nfrancis-esure

@mldangelo tagging in case you didn't get a notification for this. Apologies if you did

Feb 21 '25 09:02 nfrancis-esure

Hi @nfrancis-esure, thanks for the question!

Derived metrics use mathjs syntax for mathematical expressions. The documentation has been expanded since this issue was opened and now covers this in detail:

Derived Metrics Documentation

Quick Answer

For your specific question about averaging - derived metrics operate per prompt, not across all tests. Each named metric accumulates pass rates automatically. For example:

defaultTest:
  assert:
    - type: contains
      value: "expected"
      metric: accuracy  # This creates a named metric

derivedMetrics:
  - name: weighted_score
    value: 'accuracy * 0.6 + relevance * 0.4'  # mathjs syntax

The accuracy metric will automatically show as a percentage (e.g., "85.00% (17/20)") in the results.

For more complex logic, you can use JavaScript functions:

derivedMetrics:
  - name: custom_avg
    value: |
      function(namedScores, evalStep) {
        const { metric1 = 0, metric2 = 0 } = namedScores;
        return (metric1 + metric2) / 2;
      }