MLDataPattern.jl
MLDataPattern.jl copied to clipboard
Make DataIterators return the data instead of lazy subsets
I think the current behaviour is a design flaw that is an artefact of the early days.
Right now when you iterate over a RandomObs or a BalancedObs iterator it actually returns a datasubset (so a SubArray, or a DataSubset ). This is a bit silly since it requires a user to call getobs on the returned value to get the actual data.
for x in RandomObs(X)
actual_x = getobs(x)
# ...
end
The thing is that when X is a matrix then this issue isn't really visible, because x would just be a SubArray vector, which often works just as well as a Vector. However, if X itself was a Vector, then suddenly x is a 0-dim SubArray, and one absolutely needs getobs to use that data as intended.
I think there is no reason for x to not be the actual data at this point. There is not going to be any more sub-setting any more once we are on a per-observation level.
Note that changing this would be breaking
Does this limit possibilities for lazy (/out of memory) Iterators ?
(E.g. for when the data won't fit into RAM all at once anyway)
Or has that boat already sailed (and one should be using some other type/trick (mmap)) to handle those anyway?
No it just means that the selected observation would then be loaded directly by Base.next
sounds reasonable to me then.
I suppose this would remove the possibility of in-place modifications on the container as the iterator would yield values rather than references. So, just to exemplify, the curent behaviour allows the following:
julia> using MLDataPattern;
v = rand(5);
vr = MLDataPattern.RandomObs(v,5);
for vi in vr
vi[1] = 0.0
end; v
5-element Array{Float64,1}:
0.0
0.788368
0.0
0.597307
0.423634
That would be true, yes.
I have to say I like the current behaviour better, for its consistency (i.e. when iterating, yield subsets, call getobs to make a copy).
I see. My main pet peeve with the current behaviour is that it has strange consequences for when a DataIterator is not just a decorator around a data container (i.e. there is no such thing as a subset, the data is only available as a stream) . One way to deal with that would be that Base.next would need to return something similar to a "promise" in such a case.
This type of iteration in streams seems powerful but the concept of observation becomes somewhat ambiguous, as well as dimensionality. Is there a specific usecase you are refering to ?
What is the observation ambiguity you are referring to?
To illustrate with an example For example the Iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) has 50 observations. Each observation is about a different flower, that is to say at one point there was a stack containing 50 flowers.
Each observation has some features. In that case the features are: Sepal length, Sepal width, Petal length, Petal width and Species. the Species in this case is often used as the target (or output) feature, and since this would be classification it is called the label.
The MNIST dataset http://yann.lecun.com/exdb/mnist/ contains training set of 60,000 observations in the trainset, and 10,000 observations in the testset. In this case each observation is a different example of a hand written digit. The features of those observations are its image represented as a 28x28 matrix of grey-scale intensities for each pixel (or flattened into a 784 element vector), and which digit it is (0,1,2,... or 9). The digit is the label for classification.
MS-COCO https://arxiv.org/abs/1405.0312 is a dataset of 2.5 million captioned images. there are only 328 thousand unique images, but they have multiple captions each.
That means for most processing there are 2.5 million observations. The features of the observation are a textual caption and an image (I think 3 matrices of pixel intensities, one for each color channel.) If processing with eg an LSTM, one might then want to break the string into tokens. In that case there is a variable number of features per observation. (One could then process that string using MLDataUtils, and in that each each word would be an observation)
Each observation
I based the ambiguity statement on the assumption that the data is not in any collection-like form but a stream/buffer (where calling nobs may or may not make sense i.e. the stream is continuously modified); random observations can be generated provided some 'rule' (i.e. number of bytes/obs) for parsing the stream; the 'rule' represents for me the ambiguous part since it pertains to a container which apparently is missing. I am not sure what @Evizero is exactly hinting at and may have misunderstood the whole thing :) ...
A data iterator in my definition is just something that provides you with data each iteration (http://mldatapatternjl.readthedocs.io/en/latest/introduction/motivation.html#two-kinds-of-data-sources). If that is a random element in respect to the hidden source is open for the data iterator subtype to dictate.
For example a RandomObs is a data iterator that only really works as promised because it is a decorator around a data container (which knows N etc). In general, however, data iterators need not be decorators around containers. It could be some sensor for all the interface cares. We just don't have a concrete example like that implemented yet.
Indeed, I see the peeve. The concept of randomness does require the knowledge of some temporary N (available data at some moment) or something on that line in other dimension. Would it make sense to require the definition of a specific random(data::IO) function that the iterator could use instead of datasubset for iteration based data sources ? (this would leave open the option of implementing some kind of dynamic subset based on the IO object size if feasible).
I'd say that introducing randomness to "proper" iterators is a different problem. @oxinabox and I already spoke about reservoir sampling for such cases. I don't think solving that needs additional interface requirements.
In general, however, data iterators need not be decorators around containers. It could be some sensor for all the interface cares.
The main design issue here is what such a sensor data iterator should return on Base.next, since it can't provide a lazy "subset" such as our RandomObs. If we stick with the current behaviour of RandomObs and co, then it should be something lazy.