Daft icon indicating copy to clipboard operation
Daft copied to clipboard

feat: add forward and backward strategies for fill_null

Open nsalerni opened this issue 5 months ago • 4 comments

Changes Made

Add forward and backward strategies for fill_null. Kept backwards compatibility for the existing fill_null by either passing no strategy or strategy == 'value'.

Related Issues

Closes #4683

Checklist

  • [x] Documented in API Docs (if applicable)
  • [x] Documented in User Guide (if applicable)
  • [x] If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

nsalerni avatar Jul 15 '25 18:07 nsalerni

Two failing checks are failing on main prior to this PR.

nsalerni avatar Jul 15 '25 20:07 nsalerni

Codecov Report

:x: Patch coverage is 78.02198% with 60 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 78.76%. Comparing base (87cab23) to head (c145ad2). :warning: Report is 512 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-core/src/join.rs 50.00% 11 Missing :warning:
src/daft-dsl/src/expr/display.rs 0.00% 11 Missing :warning:
src/daft-logical-plan/src/partitioning.rs 0.00% 10 Missing :warning:
src/daft-logical-plan/src/ops/project.rs 0.00% 9 Missing :warning:
src/daft-core/src/series/ops/null.rs 92.52% 8 Missing :warning:
daft/series.py 85.71% 3 Missing :warning:
src/daft-dsl/src/expr/mod.rs 92.50% 3 Missing :warning:
src/daft-dsl/src/visitor.rs 0.00% 3 Missing :warning:
daft/expressions/expressions.py 94.11% 1 Missing :warning:
src/daft-recordbatch/src/lib.rs 88.88% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4776      +/-   ##
==========================================
- Coverage   78.90%   78.76%   -0.15%     
==========================================
  Files         886      886              
  Lines      123324   124296     +972     
==========================================
+ Hits        97313    97901     +588     
- Misses      26011    26395     +384     
Files with missing lines Coverage Δ
src/daft-core/src/lib.rs 100.00% <100.00%> (ø)
src/daft-core/src/python/series.rs 90.47% <100.00%> (+0.37%) :arrow_up:
src/daft-dsl/src/python.rs 79.38% <100.00%> (+0.45%) :arrow_up:
daft/expressions/expressions.py 96.32% <94.11%> (-0.06%) :arrow_down:
src/daft-recordbatch/src/lib.rs 85.76% <88.88%> (-0.02%) :arrow_down:
daft/series.py 91.86% <85.71%> (-0.26%) :arrow_down:
src/daft-dsl/src/expr/mod.rs 79.53% <92.50%> (+0.58%) :arrow_up:
src/daft-dsl/src/visitor.rs 56.22% <0.00%> (-0.49%) :arrow_down:
src/daft-core/src/series/ops/null.rs 92.37% <92.52%> (-7.63%) :arrow_down:
src/daft-logical-plan/src/ops/project.rs 58.17% <0.00%> (-0.11%) :arrow_down:
... and 3 more

... and 18 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jul 15 '25 21:07 codecov[bot]

Hi @nsalerni, thank you for contributing this! I took a quick look at this PR, the main issue is that this implementation is not fully correct, as it only fills nulls within a RecordBatch, and does not carry them over.

For example, if we set the RecordBatch sizes to 1 using default_morsel_size, you will see this behavior:

>>> import daft
>>> daft.context.set_execution_config(default_morsel_size=1)
DaftContext(_ctx=<builtins.PyDaftContext object at 0x102b7a5f0>)
>>> df = daft.from_pydict({"a": [1, None]})
>>> df.select(df["a"].fill_null(strategy="forward")).show()
╭───────╮
│ a     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ None  │
╰───────╯

(Showing first 2 of 2 rows)

Fixing this is not trivial and requires building up mechanisms in our physical pipelines. This is actually something we also want to build for our non-partitioned window functions as well, and would be a significant enhancement to Daft! So if you are up to it, let's talk more. I am happy to help guide you

I have some other comments too but they are minor and I will leave them for a more thorough review.

@kevinzwang happy to take a first go, let me ping you in the daft slack ! 😄

nsalerni avatar Jul 16 '25 00:07 nsalerni

Hi @nsalerni 👋 Thanks for this contribution! Just checking in here -- how is it going addressing the review feedback?

malcolmgreaves avatar Oct 21 '25 21:10 malcolmgreaves

Hi @nsalerni - Checking to see if this is something you are still able to push through.

madvart avatar Nov 18 '25 20:11 madvart