dask-ml
dask-ml copied to clipboard
Dask-xgboost example for dask-examples
It would be nice to see an example using the Dask/XGBoost handoff for parallel training and predicting. This is a common question and so would likely have high value.
It would also be useful for this to be smoothly runnable on dask-examples. Presumably we'll have to use a few processes within a LocalCluster and be careful not to blow out RAM on the small containers (XGBoost can be a bit greedy).
It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html
It's nice in many respects (real data, easily interpretable problem, ...)
However a couple things are concerning about it:
- Hard to scale down for users to try things out easily
- The ROC curve at the end is not very exciting. I wonder if there is better pre-processing that could be done if we choose to continue with this dataset
Alternatively there might be some artificial dataset that we can create more easily instead.
It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html
I certainly think this is a good example to keep, and maybe implement a new example in dask-examples. This is good for a static example – it shows an interesting problem that's harder to scale.
I think if we implement a new example for dask-examples, we should use a synthetic dataset. For me the biggest annoyance is the time it takes to process the dataset (at least a minute, often two minutes).
I've opened a PR at https://github.com/dask/dask-examples/pull/14 that mirrors dask-ml documentation example, but is quicker to run because it uses synthetic data.
This is closed by https://github.com/dask/dask-examples/pull/14, correct?
Hello everyone I'm yash, I have experience in machine learning and web D. and I'm new to open source, I have never contributed before this, will anyone give me advice how to start my first contribution.