EvoTrees.jl icon indicating copy to clipboard operation
EvoTrees.jl copied to clipboard

How to train over input that is >> larger than RAM?

Open Moelf opened this issue 3 years ago • 2 comments
trafficstars

I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.

Moelf avatar Mar 09 '22 06:03 Moelf

Support for out of memory data is a feature I'd like to see supported.

Do you have constraints with regard to the storage format of the data? On top of my mind, I'd think of working out of DTable: https://juliaparallel.github.io/Dagger.jl/stable/dtable/ and perhaps integrate with a DataLoader interface if needed. I understand your source data is in another format, yet I can hardly image a totally arbitrary data loader, as boosted trees algorithm assumes that all variables/features are consistently available to all data points.

Would it be reasonable to perform a preprocessing step on your data to bring it in a more structured form like DTable?

jeremiedb avatar Mar 11 '22 01:03 jeremiedb

Do you have constraints with regard to the storage format of the data?

yes, it's CERN ROOT, and we wrote https://github.com/tamasgal/UnROOT.jl from scratch. Physically (on disk), it's a bit like Apache Parquet.

a DataLoader interface

yeah well I don't think I can just make a DTable, because variables I'd like to use for BDT is not available in the file, and it's non-trivial selection/transformation to make those on the fly. (but we still need to make them on the fly, staging files is just too cumbersome).

Moelf avatar Mar 11 '22 01:03 Moelf