narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

feat: add methods to Dask backend

Open MarcoGorelli opened this issue 1 year ago • 27 comments

Since #272 we have really minimal support for Dask DataFrame

Expr methods should go in narwhals/_dask/expr.py. For tests, you should use existing tests, and just remove the

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

part. If you can remove that, and the test passes, it means you've done it correctly

Not too hard

Examples of Expr methods which we should add (see here for the full list):

  • Expr.__sub__
  • Expr.__mul__
  • Expr.shift
  • Expr.cum_sum
  • Expr.is_between

Note: we should not add anything which modifies the index. So, the following should not be added, even though they appear on the list in the link above:

  • Expr.sort
  • Expr.head
  • Expr.tail

Harder

  • DataFrame.group_by
  • DataFrame.filter
  • get things started with namespaces (e.g. Expr.dt, Expr.str, ...)

General guidelines:

  • please don't ask for the issue to be assigned to you
  • please don't ask for permission to work on this issue
  • if you're a first time contribute, please choose 1 method at a time to implement, and leave a comment noting which method you're working on (if you've contributed to Narwhals before, feel free to choose multiple)
  • have fun 🥳

  • Example pull request: https://github.com/narwhals-dev/narwhals/pull/731/files
  • contributing guide: https://github.com/narwhals-dev/narwhals/blob/main/CONTRIBUTING.md

An easy way to check if something still needs doing is to look for tests with

    if "dask" in str(constructor):
        request.applymarker(pytest.mark.xfail)

Then:

  1. try removing those two lines
  2. run the test, check that it fails
  3. implement this functionality for Dask
  4. check the test passes

MarcoGorelli avatar Jul 27 '24 13:07 MarcoGorelli

I'll take Expr.__sub__

anopsy avatar Jul 27 '24 16:07 anopsy

I'll take Expr.is_between

aidoskanapyanov avatar Jul 27 '24 16:07 aidoskanapyanov

Taking Expr.__mul__

anopsy avatar Jul 27 '24 16:07 anopsy

Take Expr.sum

DeaMariaLeon avatar Jul 28 '24 16:07 DeaMariaLeon

@MarcoGorelli I have a couple of questions:

  • Could you expand a little bit more, here or somewhere else on the following:

    we should not add anything which modifies the index

  • Should we refer to this issue also for dataframe methods? i.e. not Expr only?

FBruzzesi avatar Jul 28 '24 20:07 FBruzzesi

hey!

in pandas, some Series methods such as Series.sort_values change the index:

In [3]: s
Out[3]:
0    3
1    2
2    1
dtype: int64

In [4]: s.sort_values()
Out[4]:
2    1
1    2
0    3
dtype: int64

pandas / Dask would then auto-align on the index, which is what we want to avoid. in pandas we can just check the index values, as it's already eager, but in dask we don't have a way to do this (though I have opened an issue about this https://github.com/dask/dask-expr/issues/1112)

yup, dataframe methods too 😎

MarcoGorelli avatar Jul 28 '24 20:07 MarcoGorelli

I'm assuming Expr.min/Expr.max have also to be done, so I'll take those

anopsy avatar Jul 29 '24 08:07 anopsy

(Asking for a friend 👀) how much cheating are we allowed to? Specifically, pandas-like translate_dtype should apply one-to-one for dask.

FBruzzesi avatar Jul 30 '24 21:07 FBruzzesi

😄 should be fine to reuse that one

MarcoGorelli avatar Jul 30 '24 21:07 MarcoGorelli

I'll look into len and round

mistShard avatar Jul 31 '24 12:07 mistShard

I'll try drop_nulls and count(not sure about count since it returns a Scalar)

anopsy avatar Aug 02 '24 05:08 anopsy

thanks! we have others that return scalars here, such as sum / min / ...

MarcoGorelli avatar Aug 02 '24 08:08 MarcoGorelli

I will add Expr.abs

mistShard avatar Aug 02 '24 11:08 mistShard

will take Expr.all

mistShard avatar Aug 03 '24 14:08 mistShard

I'll take Expr.std and Expr.null_count

anopsy avatar Aug 07 '24 13:08 anopsy

Ill take is_unique

mistShard avatar Aug 07 '24 14:08 mistShard

will take Expr.all

will also do Expr.any because it is blocking any_all_test.py from passing

mistShard avatar Aug 08 '24 12:08 mistShard

Just to give a sense of how far we have gone and what's still missing.

DaskExpr methods (those marked with * at the end means that modify the index and require evaluation of #743 before implementing):

  • [x] cast
  • [x] quantile
  • [x] over
  • [x] diff
  • [x] is_duplicated
  • [x] is_in
  • [x] is_unique
  • [x] null_count
  • [x] arg_true (*)
  • [ ] cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
  • [ ] drop_null (*) - Currently implemented as NotImplementedError
  • [ ] filter (*)
  • [ ] gather_every (*) - Currently implemented as NotImplementedError
  • [ ] head (*) - Currently implemented as NotImplementedError
  • [ ] sample (*)
  • [ ] sort (*) - Currently implemented as NotImplementedError
  • [ ] tail (*) - Currently implemented as NotImplementedError
  • [ ] unique (*)

ExprDateTimeNamespace methods:

  • [x] to_string
  • [x] total_microseconds
  • [x] total_milliseconds
  • [x] total_minutes
  • [x] total_nanoseconds
  • [x] total_seconds

ExprCatNamespace is missing entirely.

FBruzzesi avatar Aug 11 '24 15:08 FBruzzesi

I'm working on null_count and quantile.

anopsy avatar Aug 12 '24 06:08 anopsy

will work on dt.to_string

lucianosrp avatar Aug 12 '24 11:08 lucianosrp

will work on total_microseconds

luke396 avatar Aug 15 '24 04:08 luke396

I will take diff

raisadz avatar Aug 15 '24 12:08 raisadz

I'll pick up quantile! Edit: No I won't, sorry @anopsy!

Will look at is_duplicated instead 😅

Double edit: I think is_duplicated might be a candidate for "not_implemented", looking at these github issues, it was initially deemed out of scope (since it a tricky thing to check in parallel) and now looks like it could be tabled again soon.

GH Issues:

  • https://github.com/dask/dask/issues/1854
  • https://github.com/dask/dask/issues/10374
  • https://github.com/dask/dask/pull/10542

Theoretically it'd be possible to write an implementation of some kind, but that's probably outside of the scope of Narwhals.

I think the same goes for is_unique as well.

Triple edit: Have taken is_in instead, PR incoming!

benrutter avatar Aug 16 '24 08:08 benrutter

I'll take cast

aidoskanapyanov avatar Aug 20 '24 12:08 aidoskanapyanov

As there is not much left: for anyone interested, we could use the DaskNamespace concat implementation 😇

FBruzzesi avatar Aug 21 '24 07:08 FBruzzesi

I'll take concat if nobody else has yet!

benrutter avatar Aug 21 '24 09:08 benrutter

Woah, looks like a lot is done! Is any stuff left around this? Looks like the remaining expression implementations are dependent on #743. I know a lot have dataframe equivalent implementations though (i.e. filtering a single expression is risky, but there's already a filter method that applies to the whole frame).

Also, happy to volunteer myself to compile a list if it'd be handy! 😁

benrutter avatar Sep 18 '24 11:09 benrutter

Ok, here's the list as good as I can figure for 'done' vs 'outstanding'. I've copied @FBruzzesi's one as well just so that they're all in one place! (stuck with the * for things waiting on #743 decision)

Seems like almost everything (aside from index-tricky stuff) is done! 🎉

DaskExpr

  • [x] cast
  • [x] quantile
  • [x] over
  • [x] diff
  • [x] is_duplicated
  • [x] is_in
  • [x] is_unique
  • [x] null_count
  • [x] arg_true (*)
  • [ ] cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
  • [ ] drop_null (*) - Currently implemented as NotImplementedError
  • [ ] filter () gather_every () - Currently implemented as NotImplementedError
  • [ ] head (*) - Currently implemented as NotImplementedError
  • [ ] sample (*)
  • [ ] sort (*) - Currently implemented as NotImplementedError
  • [ ] tail (*) - Currently implemented as NotImplementedError
  • [ ] unique (*)

ExprDateTimeNamespace

  • [x] to_string
  • [x] total_microseconds
  • [x] total_milliseconds
  • [x] total_minutes
  • [x] total_nanoseconds
  • [x] total_seconds

ExprCatNamespace

  • [ ] get_categories*

Dataframe

  • [x] columns
  • [x] schema
  • [x] collect_schema
  • [x] collect
  • [x] group_by
  • [x] with_columns
  • [ ] with_row_index*
  • [x] drop
  • [x] drop_nulls
  • [x] unique
  • [x] sort
  • [x] unpivot
  • [x] join
  • [x] join_asof
  • [x] filter
  • [x] select

DaskNamespace

  • [x] nth
  • [x] len
  • [x] concat
  • [x] concat_str
  • [x] lit
  • [x] mean
  • [x] mean_horizontal
  • [ ] cat*

Edit: Just looked at this again, and realised that:

  • drop is actually implemented and being tested
  • categories are eager only looking at tests, which makes sense, so probably requiring more discussion to implement.

So I think everything that doesn't require more discussion/thought is implemented?

PeanutsLucyGIF

benrutter avatar Oct 22 '24 15:10 benrutter

Closing this for now as the only methods which are left modify the index and #743 has just been closed

FBruzzesi avatar Dec 31 '24 12:12 FBruzzesi