MLJ.jl icon indicating copy to clipboard operation
MLJ.jl copied to clipboard

[Discussion] Support the MLDataPattern API for data containers

Open ablaom opened this issue 3 years ago • 0 comments

I am referring to this getobs interface which is being migrated to MLUtils.jl.

To allow for certain kinds of training using data that does not fit into memory, I should like MLJ to eventually support models that can accept data supplied by DataLoaders.jl, such as Flux models. However, I feel these models should play nicely with MLJ’s general performance evaluation (aka resampling) apparatus (eg, cross-validation) as MLJFlux models currently do. This apparatus is also used by MLJ’s IterativeModel wrapper for controlling iterative models (which needs out-of-sample performance estimates for stopping criterion, for example). However, the performance estimation apparatus has been designed principally around in-memory arrays and tabular data. This is what > 90% of models we wrap consume.

To add MLJ support for the getobs API, on which DataLoaders is based, it will be helpful if Tables.jl plays nicely with the getobs interface, something I have requested at https://github.com/JuliaML/MLUtils.jl/issues/61 (see also https://github.com/JuliaML/MLUtils.jl/issues/67). Related to this effort are apparent restrictions around the Tables.jl API around efficient row-indexing (the current API only exposes row iteration) - which is being actively investigated here.

Related online/incremental learning issue: #60

ablaom avatar Apr 01 '22 00:04 ablaom