feat: add methods to Dask backend
Since #272 we have really minimal support for Dask DataFrame
Expr methods should go in narwhals/_dask/expr.py. For tests, you should use existing tests, and just remove the
if "dask" in str(constructor):
request.applymarker(pytest.mark.xfail)
part. If you can remove that, and the test passes, it means you've done it correctly
Not too hard
Examples of Expr methods which we should add (see here for the full list):
Expr.__sub__Expr.__mul__Expr.shiftExpr.cum_sumExpr.is_between
Note: we should not add anything which modifies the index. So, the following should not be added, even though they appear on the list in the link above:
- Expr.sort
- Expr.head
- Expr.tail
Harder
DataFrame.group_byDataFrame.filter- get things started with namespaces (e.g.
Expr.dt,Expr.str, ...)
General guidelines:
- please don't ask for the issue to be assigned to you
- please don't ask for permission to work on this issue
- if you're a first time contribute, please choose 1 method at a time to implement, and leave a comment noting which method you're working on (if you've contributed to Narwhals before, feel free to choose multiple)
- have fun 🥳
- Example pull request: https://github.com/narwhals-dev/narwhals/pull/731/files
- contributing guide: https://github.com/narwhals-dev/narwhals/blob/main/CONTRIBUTING.md
An easy way to check if something still needs doing is to look for tests with
if "dask" in str(constructor):
request.applymarker(pytest.mark.xfail)
Then:
- try removing those two lines
- run the test, check that it fails
- implement this functionality for Dask
- check the test passes
I'll take Expr.__sub__
I'll take Expr.is_between
Taking Expr.__mul__
Take Expr.sum
@MarcoGorelli I have a couple of questions:
- Could you expand a little bit more, here or somewhere else on the following:
we should not add anything which modifies the index
- Should we refer to this issue also for dataframe methods? i.e. not Expr only?
hey!
in pandas, some Series methods such as Series.sort_values change the index:
In [3]: s
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: s.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
pandas / Dask would then auto-align on the index, which is what we want to avoid. in pandas we can just check the index values, as it's already eager, but in dask we don't have a way to do this (though I have opened an issue about this https://github.com/dask/dask-expr/issues/1112)
yup, dataframe methods too 😎
I'm assuming Expr.min/Expr.max have also to be done, so I'll take those
(Asking for a friend 👀) how much cheating are we allowed to? Specifically, pandas-like translate_dtype should apply one-to-one for dask.
😄 should be fine to reuse that one
I'll look into len and round
I'll try drop_nulls and count(not sure about count since it returns a Scalar)
thanks! we have others that return scalars here, such as sum / min / ...
I will add Expr.abs
will take Expr.all
I'll take Expr.std and Expr.null_count
Ill take is_unique
will take
Expr.all
will also do Expr.any because it is blocking any_all_test.py from passing
Just to give a sense of how far we have gone and what's still missing.
DaskExpr methods (those marked with * at the end means that modify the index and require evaluation of #743 before implementing):
- [x]
cast - [x]
quantile - [x]
over - [x]
diff - [x]
is_duplicated - [x]
is_in - [x]
is_unique - [x]
null_count - [x]
arg_true(*) - [ ]
cat(* as.cat.get_categories()changes the size/index, and depends onExprCatNamespace) - [ ]
drop_null(*) - Currently implemented asNotImplementedError - [ ]
filter(*) - [ ]
gather_every(*) - Currently implemented asNotImplementedError - [ ]
head(*) - Currently implemented asNotImplementedError - [ ]
sample(*) - [ ]
sort(*) - Currently implemented asNotImplementedError - [ ]
tail(*) - Currently implemented asNotImplementedError - [ ]
unique(*)
ExprDateTimeNamespace methods:
- [x]
to_string - [x]
total_microseconds - [x]
total_milliseconds - [x]
total_minutes - [x]
total_nanoseconds - [x]
total_seconds
ExprCatNamespace is missing entirely.
I'm working on null_count and quantile.
will work on dt.to_string
will work on total_microseconds
I will take diff
I'll pick up quantile! Edit: No I won't, sorry @anopsy!
Will look at is_duplicated instead 😅
Double edit: I think is_duplicated might be a candidate for "not_implemented", looking at these github issues, it was initially deemed out of scope (since it a tricky thing to check in parallel) and now looks like it could be tabled again soon.
GH Issues:
- https://github.com/dask/dask/issues/1854
- https://github.com/dask/dask/issues/10374
- https://github.com/dask/dask/pull/10542
Theoretically it'd be possible to write an implementation of some kind, but that's probably outside of the scope of Narwhals.
I think the same goes for is_unique as well.
Triple edit: Have taken is_in instead, PR incoming!
I'll take cast
As there is not much left: for anyone interested, we could use the DaskNamespace concat implementation 😇
I'll take concat if nobody else has yet!
Woah, looks like a lot is done! Is any stuff left around this? Looks like the remaining expression implementations are dependent on #743. I know a lot have dataframe equivalent implementations though (i.e. filtering a single expression is risky, but there's already a filter method that applies to the whole frame).
Also, happy to volunteer myself to compile a list if it'd be handy! 😁
Ok, here's the list as good as I can figure for 'done' vs 'outstanding'. I've copied @FBruzzesi's one as well just so that they're all in one place! (stuck with the * for things waiting on #743 decision)
Seems like almost everything (aside from index-tricky stuff) is done! 🎉
DaskExpr
- [x] cast
- [x] quantile
- [x] over
- [x] diff
- [x] is_duplicated
- [x] is_in
- [x] is_unique
- [x] null_count
- [x] arg_true (*)
- [ ] cat (* as .cat.get_categories() changes the size/index, and depends on ExprCatNamespace)
- [ ] drop_null (*) - Currently implemented as NotImplementedError
- [ ] filter () gather_every () - Currently implemented as NotImplementedError
- [ ] head (*) - Currently implemented as NotImplementedError
- [ ] sample (*)
- [ ] sort (*) - Currently implemented as NotImplementedError
- [ ] tail (*) - Currently implemented as NotImplementedError
- [ ] unique (*)
ExprDateTimeNamespace
- [x] to_string
- [x] total_microseconds
- [x] total_milliseconds
- [x] total_minutes
- [x] total_nanoseconds
- [x] total_seconds
ExprCatNamespace
- [ ] get_categories*
Dataframe
- [x] columns
- [x] schema
- [x] collect_schema
- [x] collect
- [x] group_by
- [x] with_columns
- [ ] with_row_index*
- [x] drop
- [x] drop_nulls
- [x] unique
- [x] sort
- [x] unpivot
- [x] join
- [x] join_asof
- [x] filter
- [x] select
DaskNamespace
- [x] nth
- [x] len
- [x] concat
- [x] concat_str
- [x] lit
- [x] mean
- [x] mean_horizontal
- [ ] cat*
Edit: Just looked at this again, and realised that:
dropis actually implemented and being testedcategoriesare eager only looking at tests, which makes sense, so probably requiring more discussion to implement.
So I think everything that doesn't require more discussion/thought is implemented?
Closing this for now as the only methods which are left modify the index and #743 has just been closed