keras-data-generator
keras-data-generator copied to clipboard
A problem when the sample_size is not divisible by the batch_size
I adapted DataGenerator to my Deep Learning pipeline. When the sample size is not divisible by the batch_size, the DataGenerator seems to return to the first batch without taking into account the last (smaller) batch.
Example Let A be an array of train samples, and batch_size = 4. A = [4,7,8,7,9,78,8,4,78,51,6,5,1,0]. Here A.size = 14 It is clear, in this situation, that A.size is not divisible by batch_size.
The batches the DataGenerator yields during the training process are the following :
- Batch_0 = [4,7,8,7],
- Batch_1 = [9,78,8,4]
- Batch_2 = [78,51,6,5]
- Batch_3 = [4,7,8,7] This is where the problem lies. Instead of having Batch_3 = [1,0]. It goes back to the first batch
Here is a situation where an other generator behaves well when the sample_size is not divisible by the batch_size https://stackoverflow.com/questions/54159034/what-if-the-sample-size-is-not-divisible-by-batch-size-in-keras-model
For your information, I kept as is the following instruction
int(np.floor(len(self.list_IDs) / self.batch_size))
If I change np.floor to np.ceil, it seems to bug during the training/validation phases.
As far as I know, the problem lies in __data_generation
: it creates X and y of the size self.batch_size. This isn‘t a problem when you use np.floor because then all batches are of the same size but you lose the last batch when the size of you data is not divisible by the batch_size. If it is not divisible X and y get „filled“ with as much data as the last batch provides but the rest stays empty. This empty rest is the problem because the model can‘t fit this (uninitialized (arbitrary)) data. To prevent this from happening when using np.ceil use e.g.
true_size = len(list_IDs_temp)
X = np.empty((true_size, *self.dim, self.channels))
y = np.empty((true_size), dtype=int)
in https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L42-L46 to always get a np.empty array of the matching size.
FWIW, I think you have 2 options:
- use
math.floor(len(...) / batch_size)
— this keeps all batch sizes the same, but you skip over the data past the last full-size batch (if you shuffle the indices between every epoch, you'll probably cover them eventually) - use
math.ceil(len(...) / batch_size)
— if so, you want the last batch to be smaller, rather than index past the end of the array and wrap around to the beginning of the array (or cause out-of-bounds errors)
If you use approach (2), then you have to cap the upper bound of the batch, i.e., instead of:
https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L25-L26
which can be rewritten as:
low = index * self.batch_size
high = (index+1) * self.batch_size
indexes = self.indexes[low:high]
you may want to use something like:
low = index * self.batch_size
high = min((index + 1) * self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]
to cap at the length of the array; the last batch may be smaller if the total number of items is not a multiple of batch size.
And if you want to micro-optimize, you can replace a multiplication with an addition:
low = index * self.batch_size
high = min(low + self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]
I wrote some helper classes (with tests) which can be dropped in to provide the data slicing, or you can incorporate them into your Sequence
subclasses directly.
@mbrukman Thank you for the nice indexing suggestion! I have two questions:
- Is the calculation of
high
really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])
would still result in [2 3]
- Aren't you still creating a too big X, y in
https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L45-L46
when your last batch is smaller than
self.batch_size
since thehigh
andlow
doesn't change the batch size. So when your last batch is smaller thanself.batch_size
theX
andy
will still be filled with (some) entries from the initialnp.empty
which is a problem as pointed out here: numpy empty
@gwirn wrote:
- Is the calculation of
high
really necessary as slicing over the end of an array/list likea = np.array([1, 2, 3, 4]) print(a[2:6])
would still result in
[2 3]
I think it returns [3, 4]
but your point is valid: indexing past the end of a native array in Python or a NumPy array is valid, but I'm used to computing my array bounds precisely because I use a variety of languages, in some of which, indexing outside of array bounds is invalid (and either throws an exception, or crashes, or causes undefined behavior), so I aim to be precise everywhere, for consistency (and peace of mind).
Aren't you still creating a too big X, y in https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L45-L46
when your last batch is smaller than
self.batch_size
since thehigh
andlow
doesn't change the batch size. So when your last batch is smaller thanself.batch_size
theX
andy
will still be filled with (some) entries from the initialnp.empty
which is a problem as pointed out here: numpy empty
Just to clarify, I am not creating too big X or y because this is not my repo and it's not my code, but you're correct, if we're going to make the last batch smaller than the rest, then we have to also fix the initialization to create an array that's the correct size.
My implementation of Sequence
creates a smaller last batch and doesn't initialize X
and y
to be of size batch_size
; please see the code references I linked to in my earlier response.
I think some of the sample code here and in the blog post need to be updated to handle these cases; I opened a separate issue https://github.com/afshinea/keras-data-generator/issues/7 to ask about the license for this repo (since there isn't one now) so that we can contribute some fixes for the code.
@mbrukman Yes you are right [3, 4]
not [2, 3]
that‘s a typo.
Thank you for your detailed answer!