Daft icon indicating copy to clipboard operation
Daft copied to clipboard

fix: Account for unschedulable udf actors

Open colin-ho opened this issue 8 months ago • 2 comments

Changes Made

Raises error in situations where actors for actor UDFs (i.e. concurrency udfs) cannot be scheduled due to resource constraints.

This is implemented via timeout (default 60s) on the readiness check for the actors. Additionally, each swordfish task that needs to run actor UDFs will receive the full list of actor handles, and will check for which actors are ready + local. This is to account for the fact that the UDF actors can be spawned incrementally, and when killed, can spawn on different nodes.

Related Issues

Closes https://github.com/Eventual-Inc/Daft/issues/3934

Checklist

  • [ ] Documented in API Docs (if applicable)
  • [ ] Documented in User Guide (if applicable)
  • [ ] If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

colin-ho avatar Aug 15 '25 23:08 colin-ho

Codecov Report

:x: Patch coverage is 12.03704% with 95 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 73.77%. Comparing base (33884be) to head (ec0412d). :warning: Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...intermediate_ops/distributed_actor_pool_project.rs 0.00% 42 Missing :warning:
...rc/daft-distributed/src/pipeline_node/actor_udf.rs 0.00% 25 Missing :warning:
daft/execution/ray_actor_pool_udf.py 0.00% 24 Missing :warning:
src/common/daft-config/src/python.rs 57.14% 3 Missing :warning:
...local-execution/src/intermediate_ops/cross_join.rs 0.00% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4987      +/-   ##
==========================================
- Coverage   73.81%   73.77%   -0.05%     
==========================================
  Files         957      957              
  Lines      124278   123974     -304     
==========================================
- Hits        91740    91465     -275     
+ Misses      32538    32509      -29     
Files with missing lines Coverage Δ
daft/context.py 81.57% <ø> (ø)
src/common/daft-config/src/lib.rs 72.89% <100.00%> (+0.25%) :arrow_up:
...ft-local-execution/src/intermediate_ops/explode.rs 87.50% <100.00%> (ø)
...aft-local-execution/src/intermediate_ops/filter.rs 95.16% <100.00%> (ø)
...tion/src/intermediate_ops/inner_hash_join_probe.rs 98.65% <100.00%> (ø)
...-execution/src/intermediate_ops/intermediate_op.rs 91.19% <ø> (ø)
...cal-execution/src/intermediate_ops/into_batches.rs 95.55% <100.00%> (ø)
...ft-local-execution/src/intermediate_ops/project.rs 100.00% <100.00%> (ø)
...aft-local-execution/src/intermediate_ops/sample.rs 6.35% <100.00%> (-59.28%) :arrow_down:
...c/daft-local-execution/src/intermediate_ops/udf.rs 93.72% <100.00%> (+1.16%) :arrow_up:
... and 6 more

... and 28 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 18 '25 04:08 codecov[bot]

Overall this makes sense, but I'm not sure the tests indicate the behavior we want.

What i was going for was:

  • if some of the udf actors are schedulable (within a timeout), make progress
  • if none of the udf actors are schedulable (within a timeout), error

lemme refactor a tests a bit to reflect this better

colin-ho avatar Aug 21 '25 02:08 colin-ho