narwhals feat: add methods to Dask backend

Since #272 we have really minimal support for Dask DataFrame

Expr methods should go in narwhals/_dask/expr.py. For tests, you should use existing tests, and just remove the

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

part. If you can remove that, and the test passes, it means you've done it correctly

Not too hard

Examples of Expr methods which we should add (see here for the full list):

Expr.__sub__
Expr.__mul__
Expr.shift
Expr.cum_sum
Expr.is_between

Note: we should not add anything which modifies the index. So, the following should not be added, even though they appear on the list in the link above:

Expr.sort
Expr.head
Expr.tail

Harder

DataFrame.group_by
DataFrame.filter
get things started with namespaces (e.g. Expr.dt, Expr.str, ...)

General guidelines:

please don't ask for the issue to be assigned to you
please don't ask for permission to work on this issue
if you're a first time contribute, please choose 1 method at a time to implement, and leave a comment noting which method you're working on (if you've contributed to Narwhals before, feel free to choose multiple)
have fun 🥳

Example pull request: https://github.com/narwhals-dev/narwhals/pull/731/files
contributing guide: https://github.com/narwhals-dev/narwhals/blob/main/CONTRIBUTING.md

An easy way to check if something still needs doing is to look for tests with

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

Then:

try removing those two lines
run the test, check that it fails
implement this functionality for Dask
check the test passes

Jul 27 '24 13:07 MarcoGorelli

I'll take Expr.__sub__

Jul 27 '24 16:07 anopsy

I'll take Expr.is_between

Jul 27 '24 16:07 aidoskanapyanov

Taking Expr.__mul__

Jul 27 '24 16:07 anopsy

Take Expr.sum

Jul 28 '24 16:07 DeaMariaLeon

@MarcoGorelli I have a couple of questions:

Could you expand a little bit more, here or somewhere else on the following:

we should not add anything which modifies the index
Should we refer to this issue also for dataframe methods? i.e. not Expr only?

Jul 28 '24 20:07 FBruzzesi

hey!

in pandas, some Series methods such as Series.sort_values change the index:

In [3]: s
Out[3]:
0    3
1    2
2    1
dtype: int64

In [4]: s.sort_values()
Out[4]:
2    1
1    2
0    3
dtype: int64

pandas / Dask would then auto-align on the index, which is what we want to avoid. in pandas we can just check the index values, as it's already eager, but in dask we don't have a way to do this (though I have opened an issue about this https://github.com/dask/dask-expr/issues/1112)

yup, dataframe methods too 😎

Jul 28 '24 20:07 MarcoGorelli

I'm assuming Expr.min/Expr.max have also to be done, so I'll take those

Jul 29 '24 08:07 anopsy

(Asking for a friend 👀) how much cheating are we allowed to? Specifically, pandas-like translate_dtype should apply one-to-one for dask.

Jul 30 '24 21:07 FBruzzesi

😄 should be fine to reuse that one

Jul 30 '24 21:07 MarcoGorelli

I'll look into len and round

Jul 31 '24 12:07 mistShard

I'll try drop_nulls and count(not sure about count since it returns a Scalar)

Aug 02 '24 05:08 anopsy

thanks! we have others that return scalars here, such as sum / min / ...

Aug 02 '24 08:08 MarcoGorelli

I will add Expr.abs

Aug 02 '24 11:08 mistShard

will take Expr.all

Aug 03 '24 14:08 mistShard

I'll take Expr.std and Expr.null_count

Aug 07 '24 13:08 anopsy

Ill take is_unique

Aug 07 '24 14:08 mistShard

will take Expr.all

will also do Expr.any because it is blocking any_all_test.py from passing

Aug 08 '24 12:08 mistShard

Just to give a sense of how far we have gone and what's still missing.

DaskExpr methods (those marked with * at the end means that modify the index and require evaluation of #743 before implementing):

[x] cast
[x] quantile
[x] over
[x] diff
[x] is_duplicated
[x] is_in
[x] is_unique
[x] null_count
[x] arg_true (*)
[ ] cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
[ ] drop_null (*) - Currently implemented as NotImplementedError
[ ] filter (*)
[ ] gather_every (*) - Currently implemented as NotImplementedError
[ ] head (*) - Currently implemented as NotImplementedError
[ ] sample (*)
[ ] sort (*) - Currently implemented as NotImplementedError
[ ] tail (*) - Currently implemented as NotImplementedError
[ ] unique (*)

ExprDateTimeNamespace methods:

[x] to_string
[x] total_microseconds
[x] total_milliseconds
[x] total_minutes
[x] total_nanoseconds
[x] total_seconds

ExprCatNamespace is missing entirely.

Aug 11 '24 15:08 FBruzzesi

I'm working on null_count and quantile.

Aug 12 '24 06:08 anopsy

will work on dt.to_string

Aug 12 '24 11:08 lucianosrp

will work on total_microseconds

Aug 15 '24 04:08 luke396

I will take diff

Aug 15 '24 12:08 raisadz

I'll pick up quantile! Edit: No I won't, sorry @anopsy!

Will look at is_duplicated instead 😅

Double edit: I think is_duplicated might be a candidate for "not_implemented", looking at these github issues, it was initially deemed out of scope (since it a tricky thing to check in parallel) and now looks like it could be tabled again soon.

GH Issues:

https://github.com/dask/dask/issues/1854
https://github.com/dask/dask/issues/10374
https://github.com/dask/dask/pull/10542

Theoretically it'd be possible to write an implementation of some kind, but that's probably outside of the scope of Narwhals.

I think the same goes for is_unique as well.

Triple edit: Have taken is_in instead, PR incoming!

Aug 16 '24 08:08 benrutter

I'll take cast

Aug 20 '24 12:08 aidoskanapyanov

As there is not much left: for anyone interested, we could use the DaskNamespace concat implementation 😇

Aug 21 '24 07:08 FBruzzesi

I'll take concat if nobody else has yet!

Aug 21 '24 09:08 benrutter

Woah, looks like a lot is done! Is any stuff left around this? Looks like the remaining expression implementations are dependent on #743. I know a lot have dataframe equivalent implementations though (i.e. filtering a single expression is risky, but there's already a filter method that applies to the whole frame).

Also, happy to volunteer myself to compile a list if it'd be handy! 😁

Sep 18 '24 11:09 benrutter

Ok, here's the list as good as I can figure for 'done' vs 'outstanding'. I've copied @FBruzzesi's one as well just so that they're all in one place! (stuck with the * for things waiting on #743 decision)

Seems like almost everything (aside from index-tricky stuff) is done! 🎉

DaskExpr

[x] cast
[x] quantile
[x] over
[x] diff
[x] is_duplicated
[x] is_in
[x] is_unique
[x] null_count
[x] arg_true (*)
[ ] cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
[ ] drop_null (*) - Currently implemented as NotImplementedError
[ ] filter () gather_every () - Currently implemented as NotImplementedError
[ ] head (*) - Currently implemented as NotImplementedError
[ ] sample (*)
[ ] sort (*) - Currently implemented as NotImplementedError
[ ] tail (*) - Currently implemented as NotImplementedError
[ ] unique (*)

ExprDateTimeNamespace

[x] to_string
[x] total_microseconds
[x] total_milliseconds
[x] total_minutes
[x] total_nanoseconds
[x] total_seconds

ExprCatNamespace

[ ] get_categories*

Dataframe

[x] columns
[x] schema
[x] collect_schema
[x] collect
[x] group_by
[x] with_columns
[ ] with_row_index*
[x] drop
[x] drop_nulls
[x] unique
[x] sort
[x] unpivot
[x] join
[x] join_asof
[x] filter
[x] select

DaskNamespace

[x] nth
[x] len
[x] concat
[x] concat_str
[x] lit
[x] mean
[x] mean_horizontal
[ ] cat*

Edit: Just looked at this again, and realised that:

drop is actually implemented and being tested
categories are eager only looking at tests, which makes sense, so probably requiring more discussion to implement.

So I think everything that doesn't require more discussion/thought is implemented?

PeanutsLucyGIF

Oct 22 '24 15:10 benrutter

Closing this for now as the only methods which are left modify the index and #743 has just been closed

Dec 31 '24 12:12 FBruzzesi