fsdl-text-recognizer-2022
fsdl-text-recognizer-2022 copied to clipboard
[fix]: fixes balanced subsampling bug in data/emnist.py
Account for y
labels being offset by NUM_SPECIAL_TOKENS
when calling np.bincount
in emnist balance subsampling.
The offsetting is found here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L104
and here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L106
np.bincount
will prepend zeros for elements that were not found starting from 0
to y_min_element-1
; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.
Example bug:
>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])