Daft icon indicating copy to clipboard operation
Daft copied to clipboard

feat: Allow dashboard to show query canceled/failed/dead information when query exited abnormally

Open VOID001 opened this issue 5 months ago • 1 comments

Changes Made

This is a sub-pr for the issue #5518

  • [x] Add QueryResult and PyQueryResult to show detailed state & error of a query
  • [ ] Modify Subscriber interface to accept query end result
  • [ ] Daft-Dashboard add server support for /query/<id>/end (with frontend modification)

Related Issues

#5518

Checklist

  • [ ] Documented in API Docs (if applicable)
  • [ ] Documented in User Guide (if applicable)
  • [ ] If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • [ ] Documentation builds and is formatted properly

VOID001 avatar Nov 14 '25 03:11 VOID001

Greptile Overview

Greptile Summary

This PR implements comprehensive query state tracking for canceled/failed/dead queries in the Daft Dashboard. The implementation adds QueryResult with end state and error message throughout the stack - from Python runner through Rust context to dashboard backend and frontend UI. The changes properly notify subscribers when queries end abnormally (keyboard interrupt, UDF exceptions, or general exceptions) and display appropriate status badges with error details in the dashboard.

Key Changes:

  • Added QueryEndState enum (Finished/Canceled/Failed/Dead) and QueryResult struct in Rust
  • Updated Subscriber trait to accept QueryResult in on_query_end callback
  • Modified native_runner.py to catch exceptions and notify with appropriate end states
  • Enhanced dashboard backend to handle Canceled/Failed states with error messages
  • Added UI components for Canceled/Failed states with tooltips showing error details
  • Comprehensive tests validate all query end states

Issues Found:

  • todo!() panic in engine.rs:325 for Dead state will crash dashboard if Dead state is received
  • Missing duration_sec field in TypeScript FailedStatus and CanceledStatus types causes type mismatch with backend data

Confidence Score: 3/5

  • Safe to merge after fixing critical Dead state panic and TypeScript type mismatches
  • The implementation is mostly solid with proper state propagation through the stack and good test coverage. However, two logical issues prevent this from being merge-ready: (1) the todo!() in engine.rs will panic if Dead state is received, potentially crashing the dashboard, and (2) TypeScript types are missing duration_sec fields that the backend provides, which could cause runtime type errors in the frontend
  • Pay close attention to src/daft-dashboard/src/engine.rs (Dead state panic) and src/daft-dashboard/frontend/src/hooks/use-queries.ts (missing duration fields)

Important Files Changed

File Analysis

Filename Score Overview
daft/runners/native_runner.py 4/5 Added comprehensive exception handling with try-except-else to notify subscribers of query end state (Finished/Canceled/Failed)
src/daft-context/src/python.rs 5/5 Added PyQueryResult type and updated notify_query_end to accept query result with end state and error message
src/daft-context/src/subscribers/mod.rs 5/5 Added QueryResult struct with end state and error message; updated Subscriber trait on_query_end signature
src/daft-dashboard/src/engine.rs 3/5 Added support for Canceled/Failed states in query_end endpoint; Dead state implementation is marked as todo!()
src/daft-dashboard/src/state.rs 5/5 Added Canceled and Failed variants to QueryState and QueryStatus enums with error messages and duration tracking
tests/test_subscribers.py 5/5 Added comprehensive tests for Finished, Failed, and Canceled query states with MockSubscriber validation

Sequence Diagram

sequenceDiagram
    participant User
    participant NativeRunner
    participant DaftContext
    participant Subscriber
    participant Dashboard
    
    User->>NativeRunner: run_iter(builder)
    NativeRunner->>DaftContext: notify_query_start(query_id, metadata)
    DaftContext->>Subscriber: on_query_start(query_id, metadata)
    Subscriber->>Dashboard: POST /query/{id}/start
    
    NativeRunner->>DaftContext: notify_optimization_start(query_id)
    DaftContext->>Subscriber: on_optimization_start(query_id)
    Subscriber->>Dashboard: POST /query/{id}/plan_start
    
    NativeRunner->>NativeRunner: builder.optimize()
    
    NativeRunner->>DaftContext: notify_optimization_end(query_id, plan)
    DaftContext->>Subscriber: on_optimization_end(query_id, plan)
    Subscriber->>Dashboard: POST /query/{id}/plan_end
    
    NativeRunner->>NativeRunner: execute plan
    
    alt Success Path
        loop For each result
            NativeRunner->>DaftContext: notify_result_out(query_id, result)
            DaftContext->>Subscriber: on_result_out(query_id, result)
            NativeRunner-->>User: yield result
        end
        NativeRunner->>DaftContext: notify_query_end(query_id, Finished)
        DaftContext->>Subscriber: on_query_end(query_id, Finished)
        Subscriber->>Dashboard: POST /query/{id}/end (state=Finished)
    else KeyboardInterrupt
        NativeRunner->>DaftContext: notify_query_end(query_id, Canceled)
        DaftContext->>Subscriber: on_query_end(query_id, Canceled)
        Subscriber->>Dashboard: POST /query/{id}/end (state=Canceled)
        NativeRunner-->>User: raise KeyboardInterrupt
    else UDFException
        NativeRunner->>DaftContext: notify_query_end(query_id, Failed)
        DaftContext->>Subscriber: on_query_end(query_id, Failed)
        Subscriber->>Dashboard: POST /query/{id}/end (state=Failed)
        NativeRunner-->>User: raise UDFException
    else Other Exception
        NativeRunner->>DaftContext: notify_query_end(query_id, Failed)
        DaftContext->>Subscriber: on_query_end(query_id, Failed)
        Subscriber->>Dashboard: POST /query/{id}/end (state=Failed)
        NativeRunner-->>User: raise Exception
    end
    
    Dashboard->>Dashboard: Update QueryState (Finished/Canceled/Failed)
    Dashboard->>User: Display final status in UI

greptile-apps[bot] avatar Nov 14 '25 03:11 greptile-apps[bot]

@greptile

VOID001 avatar Nov 16 '25 15:11 VOID001

Hi @srilman, this PR is ready for review, I also mentioned the impl in slack channel. You may take a look when you have time

I will finish the "liveness check" part in next PR

VOID001 avatar Nov 17 '25 09:11 VOID001

Also, I wonder how should we add unit-test or tests for this kind of PR, it is an end-to-end with dashboard-server case. I don't see currently have any good framework inside repo to perform it

VOID001 avatar Nov 17 '25 09:11 VOID001

Codecov Report

:x: Patch coverage is 56.31068% with 45 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 70.37%. Comparing base (01b06e3) to head (6a83a8a). :warning: Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-dashboard/src/engine.rs 0.00% 21 Missing :warning:
src/daft-dashboard/src/state.rs 0.00% 11 Missing :warning:
src/daft-context/src/python.rs 80.00% 7 Missing :warning:
src/daft-context/src/subscribers/dashboard.rs 0.00% 3 Missing :warning:
src/common/metrics/src/lib.rs 50.00% 1 Missing :warning:
src/daft-context/src/subscribers/debug.rs 0.00% 1 Missing :warning:
src/daft-context/src/subscribers/python.rs 83.33% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5576      +/-   ##
==========================================
- Coverage   70.38%   70.37%   -0.02%     
==========================================
  Files        1021     1021              
  Lines      131276   131360      +84     
==========================================
+ Hits        92403    92448      +45     
- Misses      38873    38912      +39     
Files with missing lines Coverage Δ
daft/context.py 87.64% <100.00%> (ø)
daft/runners/native_runner.py 80.00% <100.00%> (+4.63%) :arrow_up:
daft/subscribers/abc.py 72.22% <100.00%> (ø)
src/daft-context/src/lib.rs 95.31% <100.00%> (+0.03%) :arrow_up:
src/daft-context/src/subscribers/mod.rs 44.44% <ø> (ø)
src/common/metrics/src/lib.rs 39.39% <50.00%> (+0.68%) :arrow_up:
src/daft-context/src/subscribers/debug.rs 0.00% <0.00%> (ø)
src/daft-context/src/subscribers/python.rs 92.10% <83.33%> (-0.69%) :arrow_down:
src/daft-context/src/subscribers/dashboard.rs 2.19% <0.00%> (-0.03%) :arrow_down:
src/daft-context/src/python.rs 82.19% <80.00%> (-1.00%) :arrow_down:
... and 2 more

... and 8 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Nov 17 '25 10:11 codecov[bot]

Hi @colin-ho , could you please help take a look at this PR? Thanks a lot

VOID001 avatar Nov 19 '25 03:11 VOID001

Also, I wonder how should we add unit-test or tests for this kind of PR, it is an end-to-end with dashboard-server case. I don't see currently have any good framework inside repo to perform it

We do have tests in tests/test_subscribers.py you can add there to test that failed / cancelled events are captured by the MockSubscriber. However we do not have end to end tests with the dashboard yet.

colin-ho avatar Nov 20 '25 05:11 colin-ho

Hi @colin-ho, I've added tests and fix the comments. Could you please take another look?

Re: the dead state TODO mark. I will implement them in the next PR in order to fully close the issue #5518. Or do you want me to remove the "Dead" state in this PR and add it back in the next PR?

VOID001 avatar Nov 22 '25 10:11 VOID001

@greptile

VOID001 avatar Nov 22 '25 10:11 VOID001

Hi @colin-ho, I've added tests and fix the comments. Could you please take another look?

Re: the dead state TODO mark. I will implement them in the next PR in order to fully close the issue #5518. Or do you want me to remove the "Dead" state in this PR and add it back in the next PR?

No worries, feel free to do it in the next PR

colin-ho avatar Nov 24 '25 19:11 colin-ho