fsdl-text-recognizer-2022 icon indicating copy to clipboard operation
fsdl-text-recognizer-2022 copied to clipboard

[fix]: fixes balanced subsampling bug in data/emnist.py

Open mariovas3 opened this issue 11 months ago • 0 comments

Account for y labels being offset by NUM_SPECIAL_TOKENS when calling np.bincount in emnist balance subsampling.

The offsetting is found here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L104

and here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L106

np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.

Example bug:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])

mariovas3 avatar Mar 20 '24 18:03 mariovas3