MLDataPattern.jl
MLDataPattern.jl copied to clipboard
Updated to reflect new LearnBase interface for getobs and ObsDim
This is a WIP in response to https://github.com/JuliaML/LearnBase.jl/pull/44. The changes to the LearnBase.jl interface reduce the MLDataPattern.jl codebase significantly. Most notably, we are able to avoid a lot of the routing logic for getobs
.
Since there are other JuliaML packages that will need similar updates, I want to summarize the highlights of what this transition entails:
- Most of it is removing
convert(LearnBase.ObsDimension, obsdim)
statements;obsdim
now has no type restrictions. - For a lot of array code, generated functions were used to create views based on the
obsdim
type. This is no longer necessary, and you can directly callselectdim(A, idx, obsdim)
from Base. In some crude testing, I found no performance regressions from doing this.selectdim
essentially does the views + colon tricks that the generated functions used to do. - Make sure to remove definitions for
getobs(x)
,getobs(x, obsdim)
, andnobs(x)
from the code. The only interface functions to implement aregetobs(x, idx, obsdim)
andnobs(x, obsdim)
(note: this may change slightly based on the LearnBase.jl PR but deleting the above methods is still valid. - If most of the JuliaML packages are structured similar to MLDataPattern.jl, then the bulk of the changes will be in one or two files.
There are still some parts I haven't addressed like gettarget
and BatchView
. There are no technical challenges here; I just haven't gotten to them. We'll want to get those changes in too before merging.
You could temporarily check in a Manifest.toml
file to track on https://github.com/JuliaML/LearnBase.jl/pull/44 so as to put CI back to the track. This PR should be approved first with green checks in CI before we merge and tag a new release for https://github.com/JuliaML/LearnBase.jl/pull/44
So one issue that I've run into with MLDataPattern.jl. There are some cases (e.g. DataSubset
) where the obsdim
is stored as part of the struct. In these cases, I (speaking from a developer's perspective) want to just define
LearnBase.getobs(data::DataSubset, idx) =
getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim)
because it does not make sense that obsdim
should be anything other than subset.obsdim
. But since technically a user can call getobs(subset, idx; obsdim = 2)
(i.e. set obsdim
kwarg to anything), I need to insert a check:
function LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset))
@assert obsdim === subset.obsdim
return getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)
end
Or I can just ignore obsdim
kwarg and return getobs(subset.data, _view(subset.indices, idx); obsdim = subset.obsdim)
.
Both options seem messy to me. Since the default_obsdim
interface exists, I don't think there is any reason to be storing obsdim
as part of structs. Instead we can just do something like:
LearnBase.default_obsdim(subset::DataSubset) = default_obsdim(subset.data)
LearnBase.getobs(subset::DataSubset, idx; obsdim = default_obsdim(subset)) =
getobs(subset.data, _view(subset.indices, idx); obsdim = obsdim)
There are two reasons for this:
- In most cases, the user will never bother specifying a different
obsdim
thandefault_obsdim(subset.data)
. - If
subset.data
is an array with the observation dimension in the 2nd dim for some reason, the user can override the default by settingobsdim = 2
. This is the reason why I think that storing theobsdim
does not make much sense, since if it not the default, then the user specifies it anyways.
UPDATE:
I see now that perhaps the reason for storing obsdim
was to facilitate the bounds check in the inner constructor:
function DataSubset{T,I,O}(data::T, indices::I,obsdim::O) where {T,I,O}
if T <: Tuple
error("inner constructor should not be called using a Tuple")
end
1 <= minimum(indices) || throw(BoundsError(data, indices))
maximum(indices) <= nobs(data, obsdim) || throw(BoundsError(data, indices))
new{T,I,O}(data, indices, obsdim)
end
While this kind of check is a nice convenience, I think it just increases the code complexity of MLDataPattern.jl, and it is not necessary. If indices
is out of bounds for data
, then a bounds error will be thrown on the actual call to getindex(data, indices)
.
Sorry for the delay. Not using MLDataPattern for quite a long time, I'm not as familiar with it as you are so I don't know which is better. Generally, we want to make things simpler during this refactoring.
If making it convenient for users makes it complex and thus harder to maintain for developers, I personally prefer the complex version just to make it more intuitive to use.
the reason for storing obsdim was to facilitate the bounds check in the inner constructor:
Given that usage like DataSubset(X', 21:100, ObsDim.First())
is now unavailable, I guess we could remove that if this becomes the only usage case.
Hi, any update on this? When will [email protected] be supported? Can I help with this somehow?
Yes, this PR effort is something I haven't had time to push through. I will put it together in a stage this week where someone else can take the torch (with an explanation of what needs to be done).
For the views like FoldsView
, BatchView
, and SlidingWindow
, etc., I think we can just drop the abstract types. There is very little code that dispatches on the abstract types, and we can easily use @eval
with a loop to define repeated code for the few concrete types that subtype e.g. SlidingWindow
. Does that sound good @racinmat?
Sounds good. Although I consider eval with loop for defining them a bit of abuse and hard to read code and I would rathere have some lightweight supertype which would be part of MLDataPattern and we would define the behavior needed for FoldsView
, BatchView
, and SlidingWindow
etc. there. That sounds more julianic to me.
One of the things I would like to determine is how much of the non-getobs/nobs code can be removed (or at least is there value in keeping it). If there is a need to keep a lot of the shared interfaces, then we can define a super type. Instead of inheriting from DataView
, it would inherit from AbstractDataContainer
directly.
So it should be
ObsIterator
-> AbstractDataIterator
BatchIterator
-> AbstractDataIterator
AbstractObsView
-> AbstractDataContainer
AbstractBatchView
-> AbstractDataContainer
DataView
->AbstractDataContainer
etc.?
Yeah basically if there is an abstract type like SlidingWindow
that actually gets used (to define some shared code), then keep it. If there is an abstract type that is just another layer between AbstractDataContainer
/AbstractDataIterator
with no real use, then remove it and its subtypes can inherit directly from AbstractDataContainer
/AbstractDataIterator
.
Can someone rebase?
Hi @rancimat, what is the status of this?
@racinmat had to hand back off to me due to work commitments. I have been busy all semester, but I will work on this during break.
Yes, in the end I had less time for this, so I did what I could, but I won't be able to finish this, sorry.