MLDataPattern.jl Updated to reflect new LearnBase interface for getobs and ObsDim

This is a WIP in response to https://github.com/JuliaML/LearnBase.jl/pull/44. The changes to the LearnBase.jl interface reduce the MLDataPattern.jl codebase significantly. Most notably, we are able to avoid a lot of the routing logic for getobs.

Since there are other JuliaML packages that will need similar updates, I want to summarize the highlights of what this transition entails:

Most of it is removing convert(LearnBase.ObsDimension, obsdim) statements; obsdim now has no type restrictions.
For a lot of array code, generated functions were used to create views based on the obsdim type. This is no longer necessary, and you can directly call selectdim(A, idx, obsdim) from Base. In some crude testing, I found no performance regressions from doing this. selectdim essentially does the views + colon tricks that the generated functions used to do.
Make sure to remove definitions for getobs(x), getobs(x, obsdim), and nobs(x) from the code. The only interface functions to implement are getobs(x, idx, obsdim) and nobs(x, obsdim) (note: this may change slightly based on the LearnBase.jl PR but deleting the above methods is still valid.
If most of the JuliaML packages are structured similar to MLDataPattern.jl, then the bulk of the changes will be in one or two files.

There are still some parts I haven't addressed like gettarget and BatchView. There are no technical challenges here; I just haven't gotten to them. We'll want to get those changes in too before merging.

Oct 29 '20 16:10 darsnack

You could temporarily check in a Manifest.toml file to track on https://github.com/JuliaML/LearnBase.jl/pull/44 so as to put CI back to the track. This PR should be approved first with green checks in CI before we merge and tag a new release for https://github.com/JuliaML/LearnBase.jl/pull/44

Oct 30 '20 12:10 johnnychen94

So one issue that I've run into with MLDataPattern.jl. There are some cases (e.g. DataSubset) where the obsdim is stored as part of the struct. In these cases, I (speaking from a developer's perspective) want to just define

LearnBase.getobs(data::DataSubset, idx) =
    getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim)

because it does not make sense that obsdim should be anything other than subset.obsdim. But since technically a user can call getobs(subset, idx; obsdim = 2) (i.e. set obsdim kwarg to anything), I need to insert a check:

function LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset))
    @assert obsdim === subset.obsdim
    return getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)
end

Or I can just ignore obsdim kwarg and return getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim).

Both options seem messy to me. Since the default_obsdim interface exists, I don't think there is any reason to be storing obsdim as part of structs. Instead we can just do something like:

LearnBase.default_obsdim(subset::DataSubset) = default_obsdim(subset.data)

LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset)) =
    getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)

There are two reasons for this:

In most cases, the user will never bother specifying a different obsdim than default_obsdim(subset.data).
If subset.data is an array with the observation dimension in the 2nd dim for some reason, the user can override the default by setting obsdim = 2. This is the reason why I think that storing the obsdim does not make much sense, since if it not the default, then the user specifies it anyways.

UPDATE:

I see now that perhaps the reason for storing obsdim was to facilitate the bounds check in the inner constructor:

function DataSubset{T,I,O}(data::T, indices::I,obsdim::O) where {T,I,O}
    if T <: Tuple
       error("inner constructor should not be called using a Tuple")
    end
    1 <= minimum(indices) || throw(BoundsError(data, indices))
    maximum(indices) <= nobs(data, obsdim) || throw(BoundsError(data, indices))
    new{T,I,O}(data, indices, obsdim)
end

While this kind of check is a nice convenience, I think it just increases the code complexity of MLDataPattern.jl, and it is not necessary. If indices is out of bounds for data, then a bounds error will be thrown on the actual call to getindex(data, indices).

Nov 03 '20 16:11 darsnack

Sorry for the delay. Not using MLDataPattern for quite a long time, I'm not as familiar with it as you are so I don't know which is better. Generally, we want to make things simpler during this refactoring.

If making it convenient for users makes it complex and thus harder to maintain for developers, I personally prefer the complex version just to make it more intuitive to use.

the reason for storing obsdim was to facilitate the bounds check in the inner constructor:

Given that usage like DataSubset(X', 21:100, ObsDim.First()) is now unavailable, I guess we could remove that if this becomes the only usage case.

Nov 06 '20 06:11 johnnychen94

Hi, any update on this? When will [email protected] be supported? Can I help with this somehow?

Oct 10 '21 21:10 racinmat

Yes, this PR effort is something I haven't had time to push through. I will put it together in a stage this week where someone else can take the torch (with an explanation of what needs to be done).

Oct 11 '21 01:10 darsnack

For the views like FoldsView, BatchView, and SlidingWindow, etc., I think we can just drop the abstract types. There is very little code that dispatches on the abstract types, and we can easily use @eval with a loop to define repeated code for the few concrete types that subtype e.g. SlidingWindow. Does that sound good @racinmat?

Oct 26 '21 00:10 darsnack

Sounds good. Although I consider eval with loop for defining them a bit of abuse and hard to read code and I would rathere have some lightweight supertype which would be part of MLDataPattern and we would define the behavior needed for FoldsView, BatchView, and SlidingWindow etc. there. That sounds more julianic to me.

Oct 26 '21 06:10 racinmat

One of the things I would like to determine is how much of the non-getobs/nobs code can be removed (or at least is there value in keeping it). If there is a need to keep a lot of the shared interfaces, then we can define a super type. Instead of inheriting from DataView, it would inherit from AbstractDataContainer directly.

Oct 26 '21 12:10 darsnack

So it should be ObsIterator -> AbstractDataIterator BatchIterator -> AbstractDataIterator AbstractObsView -> AbstractDataContainer AbstractBatchView -> AbstractDataContainer DataView ->AbstractDataContainer etc.?

Oct 26 '21 13:10 racinmat

Yeah basically if there is an abstract type like SlidingWindow that actually gets used (to define some shared code), then keep it. If there is an abstract type that is just another layer between AbstractDataContainer/AbstractDataIterator with no real use, then remove it and its subtypes can inherit directly from AbstractDataContainer/AbstractDataIterator.

Oct 26 '21 15:10 darsnack

Can someone rebase?

Oct 26 '21 23:10 CarloLucibello

Hi @rancimat, what is the status of this?

Dec 25 '21 11:12 CarloLucibello

@racinmat had to hand back off to me due to work commitments. I have been busy all semester, but I will work on this during break.

Dec 25 '21 13:12 darsnack

Yes, in the end I had less time for this, so I did what I could, but I won't be able to finish this, sorry.

Dec 25 '21 13:12 racinmat

MLDataPattern.jl MLDataPattern.jl copied to clipboard

Updated to reflect new LearnBase interface for getobs and ObsDim

MLDataPattern.jl
MLDataPattern.jl copied to clipboard