Tom Augspurger comments

Results 851 comments of


                                            Tom Augspurger

trafficstars

large-ish Dask arrays randomly breaking on k8s deployment

Perhaps we leave this open to discuss if the default resources for the scheduler should be higher? I don't know what's best here. It'd be nice to have better logs...

Annotations

Thanks. I think we should start with getting this running on CI and gradually add types one module or so at a time. That'll make reviewing things much easier.

It's possible to only use mypy on specific files. Pandas is going through that now. You can see the configuration starting at https://github.com/pandas-dev/pandas/blob/dbd7a5d3e2f1d196e8634c620fc72db1127de157/setup.cfg#L124. I think the current effort is around...

Merge Dask-GLM into Dask-ML

This fell on the back-burner since it's mostly just a development workflow thing. It shouldn't have any user-facing changes. The high-level split will still be * dask-glm for optimizers like...

Merge Dask-GLM into Dask-ML

FYI @mmccarty fixed the merge conflicts. Will see if CI passes.

Categorizer does not preserve order of categories for Pandas != 1.2

> Another question is if one should ever rely on the order of categories in Pandas categorical types... Only if the categorical is ordered. What does the proposed fixed behavior...

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0

Thanks for the report @zexuan-zhou. Are you able to debug it further? Most likely scikit-learn previously cast a (dask) DataFrame to an ndarray, but no longer does that. We were...

ColumnTransformer: 'DataFrame' object has no attribute 'take' with sklearn >= 1.0.0

Thanks for the reproducible example. We'll need someone to step through and figure out exactly what changed in scikit-learn / pandas and adapt. I won't have time to work on...

Use an already trained Keras model to predict on lots of data

FYI, I started on this at https://gist.github.com/TomAugspurger/2889a052b5fec4d691f83ba2062d2d92 As you predicted `X.map_blocks(model.predict)` was slow. I stopped as soon as I hit an error, and didn't do any profiling yet. I'll pick...

Use an already trained Keras model to predict on lots of data

Oh, and `/profile-server` is going to be extremely useful here. On a whim, I tried `X.map_blocks(delayed(model.predict))` and the scheduler has been at 100% CPU for a minute while the workers...