narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

API: make decision on empty sums

Open MarcoGorelli opened this issue 7 months ago • 3 comments

xref https://github.com/narwhals-dev/narwhals/discussions/1657#discussioncomment-13003096

I think this is the last big item we need to resolve before we can think about narwhals.stable.v2

MarcoGorelli avatar May 01 '25 11:05 MarcoGorelli

My first thought would be unify on what polars does.

Some questions I might ask though would be:

  • What do each of the backends do natively?
  • Is polars the ugly duckling?
  • Have there been issues raised in polars regarding it's behavior?
  • Do you have some examples where you see the current situation causing issues?

dangotbanned avatar May 01 '25 11:05 dangotbanned

What do each of the backends do natively?

  • pandas, polars, pyarrow and dask return 0
  • duckdb and pyspark (and sqlframe) return null

🥲

FBruzzesi avatar May 03 '25 16:05 FBruzzesi

I out unifying on Polars' behavoiur, and for duckdb we get a bunch of mysterious

duckdb.duckdb.ParserException: Parser Error: syntax error at or near "over"

TBH I'm not overly bothered about this difference

I'm tempted just document it, alongside the other differences that we list (e.g. boolean columns, nan vs null)

MarcoGorelli avatar May 20 '25 10:05 MarcoGorelli

pandas, polars, pyarrow and dask return 0

@FBruzzesi are you sure about this?

In [9]: pa.array([], type='int64').sum()
Out[9]: <pyarrow.Int64Scalar: None>

In [10]: pc.sum(pa.array([], type='int64'))
Out[10]: <pyarrow.Int64Scalar: None>

So it really seems to be a case of "pandas-like and Polars" vs "the rest"

As much as I usually prefer Polars' decisions, I really think I prefer "the rest" here 😳


empty sum being 0 does lead to what i find to be surprising non-identities:

In [21]: df
Out[21]:
shape: (0, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
└─────┘

In [22]: df.select(pl.col('a').sum() / pl.col('a').count())
Out[22]:
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ NaN │
└─────┘

In [23]: df.select(pl.col('a').mean())
Out[23]:
shape: (1, 1)
┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ null │
└──────┘

From a practical point, empty sum being 0, in my head, means "treat a broken sensor as a zero measurement", which could lead to incorrect conclusions down the line (rather than being alterted that something was wrong, e.g. that a sensor was failing to report data)

MarcoGorelli avatar Jun 04 '25 13:06 MarcoGorelli

@MarcoGorelli I was checking using a different approach:

import pyarrow as pa
import narwhals as nw

data = {"a": [1, 2], "b": [3,4]}

(
    nw.from_native(pa.table(data))
    .filter(nw.col("a")>3)
    .select(nw.all().sum())
)
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|  pyarrow.Table   |
|  a: int64        |
|  b: int64        |
|  ----            |
|  a: [[0]]        |
|  b: [[0]]        |
└──────────────────┘

so now I am even more confused 🤷🏼‍♂️

FBruzzesi avatar Jun 04 '25 14:06 FBruzzesi

I understand where you're coming from @MarcoGorelli and it does seem reasonable to want to standardise on the null propagation.

Also - for the questions I asked in (https://github.com/narwhals-dev/narwhals/issues/2464#issuecomment-2844685314) - I feel like we've got a good idea on all of them except:

Some questions I might ask though would be:

* Have there been issues raised in `polars` regarding it's behavior?

I think the bar for deviating from both polars and pandas behavior should be much higher than either:

  • Standardising on polars
  • Letting each backend do their thing

Example

Let's say I maintain a package that only supports pandas, and someone suggests narwhals as a way to support polars.

Everything seems to be going smoothly in the transition, but now my tests are failing as my assumptions on how pandas handles this behavior are no longer true. I assume this must be a polars thing so go to debug by replacing all my nw.col with pl.col - but I can't reproduce anything.

I can imagine this being a point of confusion - particularly if you haven't been exposed to the other behavior and you didn't plan on supporting anything beyond pandas and polars.

Side note

Can we even enforce this for pl.LazyFrame? I feel like we'd need to be materializing everywhere to check .is_empty() 🤔

dangotbanned avatar Jun 04 '25 15:06 dangotbanned

so now I am even more confused 🤷🏼‍♂️

@FBruzzesi for pyarrow, this is how we'd control the behavior for sum:

import pyarrow as pa
from pyarrow import compute as pc

data = {"a": [1, 2], "b": [3, 4]}
frame = pa.table(data)
filtered = frame.filter(pc.field("a") > 3)

sum_default = pc.sum(filtered.column(0))
sum_no_min = pc.sum(filtered.column(0), min_count=0)
>>> sum_default, sum_no_min
(<pyarrow.Int64Scalar: None>, <pyarrow.Int64Scalar: 0>)

dangotbanned avatar Jun 04 '25 15:06 dangotbanned

I can't find the history as to why SQL opted to return null for an empty summation but I would guess that the primary reason is that one can disambiguate a summation over an empty input vs an a summation that ultimately reaches 0. Some of the most common operations in SQL are going to be aggregations and filtering, and this is an easy way for aggregation to take advantage of tri-state logic to convey information about the inputs to the aggregation. IMO this is a motivating reason to have null be the return value.

However, mathematically speaking a

Summation of an empty sequence (a sequence with no elements), by convention, results in 0.

  • https://en.wikipedia.org/wiki/Summation#:~:text=Summation%20of%20an%20empty%20sequence%20

This is likely why both base Python and numpy return 0 (there may be other reasons, I haven't done a historical dive here). So clearly we have both motivating reasons and historical precedent that go either way, and this is probably one of the largest design differences in these tools that we've had to make a decision on.

My input is that Narwhals is a tool for neither math nor to solely do what Polars does, but to design an intuitive and consistent DataFrame interface for users. In that mindset I think we should support both via a min_count argument like pandas and PyArrow do. It is technical overhead, but this generalized approach enables us to make these backends cross compatible which I believe is the ultimate goal; not just being able to write the same syntax but also be able to receive the same results.

For these major backends we can define the behaviors like so (please correct me if any of these appear wrong, there's only so many tabular interfaces I can hold in my head at once):

Library Empty summation result min_count arg Implementation
pandas 0 True sum(min_count=...)
polars 0 False when(count(...) > min_count).then(pl.sum()).otherwise(None)
pyarrow 0 True sum(min_count=...)
dask 0 True sum(min_count=...)
duckdb null False case when count(...) > min_count then sum() else null
pyspark null False when(count(...) > min_count, sum()).otherwise(None)
sqlframe null False same as pyspark (I believe?)

For those already concerned with performance, we have control over how the front end code is translated before it is evaluated by each backend. So it is very easy for us to add special cases for the natural result of each backend (e.g. nw.col(...).sum(min_count=0) would translate to Polars pl.col(...).sum() and nw.col(...).sum(min_count=1) would just translate to sum(...) for duckdb).

That said we would then need to specify an appropriate default for min_count and even consider making this an option to all numeric aggregation operations.

camriddell avatar Jun 04 '25 15:06 camriddell

Very impressive write-up @camriddell!

Adding a min_count parameter definitely seems like the most pragmatic option for us. I was going to suggest raising a feature request in polars, and I'd be fully onboard at that stage (regardless of if it landed).

However, a quick search revealed the current behavior is intentional and unlikely to change 😔

  • https://github.com/pola-rs/polars/issues/19700
  • https://github.com/pola-rs/polars/issues/17527#issuecomment-2497297377

So I'd say I'm +0.5 on adding min_count. Happy to take that route if everyone else is 🙌

dangotbanned avatar Jun 04 '25 17:06 dangotbanned

thanks all for comments!

yup, for pyarrow we already use min_count=0, presumably to align behaviour

i don't think we need a min_count parameter, it should already be controllable with when/then/otherwise and count? there shouldn't be any need to materialise (which is something we never do)

MarcoGorelli avatar Jun 05 '25 09:06 MarcoGorelli

@MarcoGorelli in this case I'm +1 for following Polars as we currently do. Returning 0 is a behavior that has an even stronger historical precedence than the SQL implementation.

Since this will be a known disparity between the set of results created by the DataFrame vs SQL interfaces we should document this and provide guidance on how to force the result to be one or the other.

e.g. If you are expecting users to make use of either DuckDB or Polars and want a consistent summation value for empty inputs across backends:

(Also @MarcoGorelli I think I uncovered a fun logic issue with narwhals.When that diverges from how Polars works [see the bottom-most example in the snippet here]).

import narwhals as nw
import polars as pl
import duckdb

def agnostic_sum(df):
    return (
        nw.from_native(df)
        .select(
            nw.col('a').cast(nw.Int32).sum()
        )
    )

pl_df = pl.DataFrame({'a': []})
duck_rel = duckdb.sql('select * from pl_df')

print(
    ' Different results from empty summation agnostic code! '.center(80, '-'),
    agnostic_sum(pl_df),
    agnostic_sum(duck_rel),
    sep='\n',
)
# ------------ Different results from empty summation agnostic code! -------------
# ┌──────────────────┐
# |Narwhals DataFrame|
# |------------------|
# |  shape: (1, 1)   |
# |  ┌─────┐         |
# |  │ a   │         |
# |  │ --- │         |
# |  │ i32 │         |
# |  ╞═════╡         |
# |  │ 0   │         |
# |  └─────┘         |
# └──────────────────┘
# ┌──────────────────┐
# |Narwhals LazyFrame|
# |------------------|
# |    ┌────────┐    |
# |    │   a    │    |
# |    │ int128 │    |
# |    ├────────┤    |
# |    │   NULL │    |
# |    └────────┘    |
# └──────────────────┘



def agnostic_sum_zero(df):
    return (
        nw.from_native(df)
        .select(
            nw.col('a').cast(nw.Int32).sum().fill_null(0)
        )
    )


print(
    ' Consistent Zero from empty summation agnostic code! '.center(80, '-'),
    agnostic_sum_zero(pl_df),
    agnostic_sum_zero(duck_rel),
    sep='\n',
)
# ------------- Consistent Zero from empty summation agnostic code! --------------
# ┌──────────────────┐
# |Narwhals DataFrame|
# |------------------|
# |  shape: (1, 1)   |
# |  ┌─────┐         |
# |  │ a   │         |
# |  │ --- │         |
# |  │ i32 │         |
# |  ╞═════╡         |
# |  │ 0   │         |
# |  └─────┘         |
# └──────────────────┘
# ┌──────────────────┐
# |Narwhals LazyFrame|
# |------------------|
# |    ┌────────┐    |
# |    │   a    │    |
# |    │ int128 │    |
# |    ├────────┤    |
# |    │      0 │    |
# |    └────────┘    |
# └──────────────────┘


####

def agnostic_sum_null(df):
    # this Polars code works
    # return df.select(
    #     pl.when(pl.col('a').cast(pl.Int32()).count() == 0).then(100)
    # )
    # this Narwhals code does not work "Expressions which aggregate or change length cannot be passed to 'when'
    # return df.select(
    #     nw.when(nw.col('a').cast(nw.Int32()).count() == 0).then(100)
    # )

    # workaround requires forced evaluation.
    return nw.from_native(df).select(
        a_sum=nw.col('a').cast(nw.Int32()).sum(),
        a_count=nw.col('a').cast(nw.Int32()).count(),
    ).select(
        a=nw.when(nw.col('a_count') > 1).then(nw.col('a_sum')).otherwise(None)
    )


print(
    ' Consistent Null from empty summation agnostic code! '.center(80, '-'),
    agnostic_sum_null(pl_df),
    agnostic_sum_null(duck_rel),
    sep='\n',
)
# ------------- Consistent Null from empty summation agnostic code! --------------
# ┌──────────────────┐
# |Narwhals DataFrame|
# |------------------|
# |  shape: (1, 1)   |
# |  ┌──────┐        |
# |  │ a    │        |
# |  │ ---  │        |
# |  │ i32  │        |
# |  ╞══════╡        |
# |  │ null │        |
# |  └──────┘        |
# └──────────────────┘
# ┌──────────────────┐
# |Narwhals LazyFrame|
# |------------------|
# |    ┌────────┐    |
# |    │   a    │    |
# |    │ int128 │    |
# |    ├────────┤    |
# |    │   NULL │    |
# |    └────────┘    |
# └──────────────────┘

camriddell avatar Jun 05 '25 15:06 camriddell

Thanks all for comments

I think you've all made a fairly convincing argument to not change the pandas/polars default. (i still think null would be a more useful result, but not by enough of a margin that it's worth deviating from those libraries)

MarcoGorelli avatar Jun 06 '25 11:06 MarcoGorelli

duckdb.duckdb.ParserException: Parser Error: syntax error at or near "over"

oooh, https://github.com/narwhals-dev/narwhals/pull/2649 would resolve this!

EDIT:, well with some further work. but i think it opens the doors to it


gonna try to unify on 0 then. people (like me) who prefer null should still be able to standardise on that with when/then / count anyway (but we may need to address #2645 )

MarcoGorelli avatar Jun 06 '25 13:06 MarcoGorelli