promptfoo icon indicating copy to clipboard operation
promptfoo copied to clipboard

feat: improve eval summary output with compact layout and better UX

Open mldangelo opened this issue 2 months ago • 2 comments

Summary

Redesigns the eval summary output to be more compact, readable, and user-friendly while preserving all information. Reduces output from ~26 lines to ~10 lines and fixes a critical bug where grading tokens weren't displayed in certain scenarios.

Before vs After

Before (26 lines)

✓ Eval complete
ID: eval-xyz-123

» Run promptfoo view to use the local web viewer
» Run promptfoo share to create a shareable URL
» This project needs your feedback. What's one thing we can improve? https://promptfoo.dev/feedback

Total Tokens: 2,927
  Eval: 929 (929 prompt, 0 completion, 929 cached)
  Grading: 1,998 (1,636 prompt, 362 completion)

Provider Breakdown:
  openai:gpt-5: 496 (58 prompt, 438 completion, 384 reasoning)
  openai:gpt-5-mini: 433 (58 prompt, 375 completion, 320 reasoning)
Pass Rate: 50.00%
Results: 4 passed, 4 failed, 0 errors
Duration: 9s (concurrency: 4)

After (10 lines)

✓ Eval complete (ID: eval-xyz-123)

» View results: promptfoo view
» Feedback: https://promptfoo.dev/feedback

Total Tokens: 2,927
  Eval: 929 (cached)
  Grading: 1,998 (1,636 prompt, 362 completion)

Providers:
  openai:gpt-5: 496 (0 requests; cached)
  openai:gpt-5-mini: 433 (0 requests; cached)

Results: ✓ 4 passed, ✗ 4 failed, 0 errors (50%)
Duration: 9s (concurrency: 4)

After (with --no-share flag)

✓ Eval complete (ID: eval-xyz-123)

» View results: promptfoo view
» Feedback: https://promptfoo.dev/feedback

Total Tokens: 2,927
  Eval: 929 (cached)
  Grading: 1,998 (1,636 prompt, 362 completion)

Providers:
  openai:gpt-5: 496 (0 requests; cached)
  openai:gpt-5-mini: 433 (0 requests; cached)

Results: ✓ 4 passed, ✗ 4 failed, 0 errors (50%)
Duration: 9s (concurrency: 4)

(Notice: share guidance is hidden when user explicitly disables sharing)

Key Changes

Layout Improvements

  • Combined completion + ID: ✓ Eval complete (ID: eval-xyz-123) on single line
  • Visual spacing: Added blank line between provider breakdown and results for better separation
  • Compact guidance: Simplified and action-oriented recommendations

UX Improvements

  • Milder green: Changed from greenBright.bold() to green.bold() for better readability
  • Stronger view recommendation: "View results: promptfoo view" instead of "Run promptfoo view to use the local web viewer"
  • Conditional share guidance: Respects --no-share flag and doesn't show share line when user explicitly disabled sharing
  • Simplified feedback: "Feedback: https://promptfoo.dev/feedback" instead of long marketing message

Token Display Improvements

  • Smart caching display: Shows 929 (cached) instead of 929 (929 cached) when 100% cached
  • Always show request counts: Displays "0 requests" to explicitly indicate 100% cache hit
  • Only show eval breakdown when relevant: Hides "Eval:" line when there are no eval tokens

Bug Fixes

  • CRITICAL: Fixed bug where grading tokens weren't displayed when eval had no provider tokens (e.g., when using --model-outputs with llm-rubric assertions)
  • Only show "Eval:" breakdown when there are actual eval tokens to display

Other Improvements

  • Smart pass rate precision: 100% for exact values, 95.67% for decimals
  • Consistent ✓/✗ symbols for accessibility
  • Better visual hierarchy throughout

Testing

  • ✅ All 7,538 tests pass across 421 test suites
  • ✅ Tested with various flags: --no-share, --no-write, --no-cache
  • ✅ Tested edge cases: zero tokens, grading-only tokens, perfect/zero pass rates
  • ✅ Tested with single and multiple providers

Breaking Changes

None. All existing functionality is preserved with improved presentation.

mldangelo avatar Oct 18 '25 06:10 mldangelo

⏩ No test execution environment matched (25bb939dc2c360bd0816c434c854fc0be074af54) View output ↗


View check history

Commit Status Output Created (UTC)
134605677eea9ed9ba594281a7e8aa7e14f0ba45 ⏩ No test execution environment matched Output Oct 18, 2025 6:55AM
4cace81f24b1991613b0e5baa96d80f0ffd29ac1 ⏩ No test execution environment matched Output Oct 18, 2025 6:57AM
84c0352e45f456735ac0f59abf7af263a43bef8a ⏩ No test execution environment matched Output Oct 18, 2025 2:58PM
4f4f431b9e5c0a2ca4d1591ad94c342d859a454b ⏩ No test execution environment matched Output Oct 18, 2025 3:31PM
75f211ee7d09daac8317ec1f0e55febdd9034a21 ⏩ No test execution environment matched Output Oct 18, 2025 4:16PM
188f79078cb43f52f2fcf6ca465b68a511d2a570 ⏩ No test execution environment matched Output Oct 18, 2025 6:07PM
9bff86abf3d69ce80039c758f9d9a0c6275269e3 ⏩ No test execution environment matched Output Oct 25, 2025 7:16AM
75e90da50f48d346e509cf16ae92ea8323d7dd89 ⏩ No test execution environment matched Output Oct 31, 2025 4:21AM
efa4b9e0bf57b361a544c45b24d791d1342644c0 ⏩ No test execution environment matched Output Nov 18, 2025 2:21PM
7567ace810811743d07f6446e861d5833abc2f24 ⏩ No test execution environment matched Output Nov 19, 2025 4:56AM
cc0638a5f2ddc4d26de9e20514a900110d24f0ac ⏩ No test execution environment matched Output Nov 19, 2025 6:51AM
1b410630f2ab467f48740fc2398dcb0c6905f206 ⏩ No test execution environment matched Output Nov 20, 2025 4:01PM
b115700e6f634b0cc88d7c0fb24a90e5c34da5e4 ⏩ No test execution environment matched Output Nov 20, 2025 6:52PM
804a13db2bb981b48958cc9d250163dbe287fb18 ⏩ No test execution environment matched Output Nov 25, 2025 5:32PM
103f4d918a9c1f039e4624974ce7caf133c225db ⏩ No test execution environment matched Output Nov 25, 2025 5:59PM
41d252d8a3eb524414e844be7edbe97fee9f82f0 ⏩ No test execution environment matched Output Dec 6, 2025 6:17PM
f08f66d228a6976fd740ee58797d8a89ed2080ac ⏩ No test execution environment matched Output Dec 6, 2025 6:44PM
4abc54c4697a0bca96872f77cb52338195192b07 ⏩ No test execution environment matched Output Dec 6, 2025 6:59PM
4be6ec8e22e0ca5e1499e4e02fdc31df2a343e58 ⏩ No test execution environment matched Output Dec 6, 2025 7:01PM
6478c70af9530522b5661ce672fa41ab906c0e7a ⏩ No test execution environment matched Output Dec 6, 2025 7:23PM
f04618a0d2be87e6524a3adea7f5ea7182f6146e ⏩ No test execution environment matched Output Dec 6, 2025 7:27PM
b6eb8833d94afe106cc579bedf58b9727b0cb913 ⏩ No test execution environment matched Output Dec 6, 2025 7:29PM
3bc4cb7b81ca944463552b2236a33cc529c92d52 ⏩ No test execution environment matched Output Dec 6, 2025 7:33PM
9e7d73c1febc9824619a4e2cb3ad865baa7daed2 ⏩ No test execution environment matched Output Dec 6, 2025 7:59PM
2760bf6fc11b0f0741ccc7335969db41ba260ebc ⏩ No test execution environment matched Output Dec 6, 2025 9:50PM
4771b388d669a2c33367937daf9aa6d9400a6193 ⏩ No test execution environment matched Output Dec 6, 2025 9:57PM
6d1e967937be06b35c7d332245ba42183106fd5e ⏩ No test execution environment matched Output Dec 6, 2025 10:04PM
06cfdbe06b9639d3e13f8a9b11d6de85ab232509 ⏩ No test execution environment matched Output Dec 6, 2025 10:07PM
21c846acb6fdf18085cef6756a7406c3e1c440e6 ⏩ No test execution environment matched Output Dec 12, 2025 4:28PM
b5573e3b0d6ea634a82b4e328577a7caab6fd354 ⏩ No test execution environment matched Output Dec 13, 2025 3:33PM
9abcb1ade911bb75462a87abf295f9d7e52a3ac0 ⏩ No test execution environment matched Output Dec 13, 2025 3:34PM
fb5b4ff785b0fe1d62d219ccd03e5f66aceea2b5 ⏩ No test execution environment matched Output Dec 20, 2025 4:27AM
25bb939dc2c360bd0816c434c854fc0be074af54 ⏩ No test execution environment matched Output Dec 20, 2025 4:31AM

View output in GitHub ↗

use-tusk[bot] avatar Oct 18 '25 06:10 use-tusk[bot]

📝 Walkthrough

Walkthrough

The change reworks the finalization and reporting flow in doEval by consolidating completion messaging into a single line, replacing separate branches with conditional logic to compute one message for various scenarios. It introduces combined token usage summaries for eval and grading tokens, adds provider breakdown presentation for multiple providers, implements new pass-rate formatting with coloring, and reorganizes the display sequence with improved spacing and section headers. Existing discrete log lines are replaced with concise, bolded summaries and standardized guidance text. Conditions around sharing guidance are adjusted based on explicit user preferences.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Rationale: This change is confined to a single file but involves multiple interconnected logic modifications across different sections: completion messaging consolidation, token usage calculations, provider breakdown logic, pass-rate formatting, and display sequencing. While the changes follow consistent patterns (refactoring output generation), they span several functional areas requiring separate reasoning to verify correctness of each logic block and their interactions. The mix of conditional logic refinements and formatting adjustments adds moderate complexity without reaching dense logic density across many files.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: improve eval summary output with compact layout and better UX" is fully aligned with the main changes in the changeset. The raw summary and PR description confirm the core change is redesigning the eval summary output to be more compact and readable while preserving functionality. The title is concise (67 characters), uses clear and specific language that accurately describes the primary focus, and follows conventional commit format. A developer scanning the project history would immediately understand this is about improving the output formatting and user experience of eval summaries.
Description check ✅ Passed The pull request description comprehensively describes the changeset, detailing the redesigned eval summary output, async background sharing feature, layout improvements, UX enhancements, token display improvements, and bug fixes.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch feature/compact-eval-summary

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 18 '25 06:10 coderabbitai[bot]

Does this require another review, @mldangelo?

will-holley avatar Dec 01 '25 15:12 will-holley