decision-forests Auto-train, auto-tune & auto-serve the best TF-DF model directly from CSV files

Auto-train, auto-tune & auto-serve the best TF-DF model directly from CSV files

Open rishiraj opened this issue 3 years ago • 1 comments

As my GSoC contributions are almost over, as a part of my additional work, I'm working on developing a layer above TF-DF that:

This will help in making TF-DF a favorite choice for dealing with tabular data in the Kaggle community where most training data are in CSV format.

Aug 23 '22 15:08 rishiraj

That sounds awesome @rishiraj !! This tutorial may be a starting point.

Some optional suggestions/ideas:

For small datasets, add support for n-fold cross-validation. This will make the best use of the data available. Also for small datasets it's so fast to train, that it doesn't cost much.
For large datasets, two suggestions, we often make here:
- Do most of the hyperparameter tuning on a smaller (sub-sampled) dataset for speed. It may not be optimal, but often due to resources constraints it's more feasible. Once the best parameters are found, train on the whole data.
- Parallelize the hyperparameter tuning in various machines -- requires knowhow of whatever cloud solution one uses to parallelize. The complication here is not ML, but dealing with starting jobs and collecing results.

cheers!

Aug 24 '22 09:08 janpfeifer