[Discussion] Support the MLDataPattern API for data containers
I am referring to this getobs interface which is being migrated to MLUtils.jl.
To allow for certain kinds of training using data that does not fit into memory, I should like MLJ to eventually support models that can accept data supplied by DataLoaders.jl, such as Flux models. However, I feel these models should play nicely with MLJ’s general performance evaluation (aka resampling) apparatus (eg, cross-validation) as MLJFlux models currently do. This apparatus is also used by MLJ’s IterativeModel wrapper for controlling iterative models (which needs out-of-sample performance estimates for stopping criterion, for example). However, the performance estimation apparatus has been designed principally around in-memory arrays and tabular data. This is what > 90% of models we wrap consume.
To add MLJ support for the getobs API, on which DataLoaders is based, it will be helpful if Tables.jl plays nicely with the getobs interface, something I have requested at https://github.com/JuliaML/MLUtils.jl/issues/61 (see also https://github.com/JuliaML/MLUtils.jl/issues/67). Related to this effort are apparent restrictions around the Tables.jl API around efficient row-indexing (the current API only exposes row iteration) - which is being actively investigated here.
Related online/incremental learning issue: #60