EvoTrees.jl
EvoTrees.jl copied to clipboard
How to train over input that is >> larger than RAM?
I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.
Support for out of memory data is a feature I'd like to see supported.
Do you have constraints with regard to the storage format of the data? On top of my mind, I'd think of working out of DTable: https://juliaparallel.github.io/Dagger.jl/stable/dtable/ and perhaps integrate with a DataLoader interface if needed. I understand your source data is in another format, yet I can hardly image a totally arbitrary data loader, as boosted trees algorithm assumes that all variables/features are consistently available to all data points.
Would it be reasonable to perform a preprocessing step on your data to bring it in a more structured form like DTable?
Do you have constraints with regard to the storage format of the data?
yes, it's CERN ROOT, and we wrote https://github.com/tamasgal/UnROOT.jl from scratch. Physically (on disk), it's a bit like Apache Parquet.
a DataLoader interface
yeah well I don't think I can just make a DTable, because variables I'd like to use for BDT is not available in the file, and it's non-trivial selection/transformation to make those on the fly. (but we still need to make them on the fly, staging files is just too cumbersome).