fuel
fuel copied to clipboard
False singleton dimension for scalar targets considered harmful
Consider this code:
In [1]: from fuel.datasets import MNIST
In [2]: import numpy
In [3]: first_five_targets = MNIST('train').get_data(request=range(5))[-1] # They happen to be 5, 0, 4, 1, 9
Let's consider the output of some hypothetical neural network:
In [4]: outputs = numpy.array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
...: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
...: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
...: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
...: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
...: ])
In [5]: predictions = outputs.argmax(axis=-1)
Now let's calculate our accuracy. Should be 4/5 = 0.8.
In [6]: (first_five_targets == predictions).mean()
Out[6]: 0.20000000000000001
This silently broadcasted, incorrectly, to a 5x5 binary matrix. To achieve the correct result we need
In [7]: (first_five_targets.squeeze() == predictions).mean()
Out[7]: 0.80000000000000004
This is a tricky bug to spot, and these sorts of bugs could be obviated by considering scalar targets scalar and batches of scalar targets 1-dimensional.
@dwf is this issue still relevant?
I think it is, since it also makes it trickier to adapt non-Fuel code to use Fuel, as I experienced when adapting the tutorials for the deep learning summer school.
Indeed, it's kind of annoying to remember to flatten each time. I also don't totally buy the metadata argument that was the justification: that the dimension label should be "index". That is not, in fact, the label for the dimension but a label for a particular column on that dimension. What you're describing with that is more the semantics of elements of the "space" in which the items lie.
I personally like the consistency that it is a 2D array, because it makes having a single label just a special case of the multiple label case i.e. batch[i] returns an array of labels for a sample. In this case that array has length 1 (i.e. batch is a column vector), but that doesn't always have to be the case, and batch could just as well be a 2D array or a nested list.
If we're already giving up on the elements batch[i] being arrays (i.e. they could be lists), then I don't care if they're sequences or not. The tradeoff of using atleast_1d() in places where I expect an array is better than accidental silent broadcasting, as operations that explicitly require a sequence (or more specifically an array) and receive a scalar will usually raise an error, whereas operations that expect a vector and get a matrix will often not (as demonstrated by the example in the description text).