fsdl-text-recognizer-2022
fsdl-text-recognizer-2022 copied to clipboard
[bug]: np.bincount prepends zeros in data/emnist.py
I checked this issue has not been duplicated.
Hi @charlesfrye , I'm not sure if this repo is accepting PRs, but I spotted a bug in the data/emnist.py
file. It concerns the _sample_to_balance
function and the usage of np.bincount
in it here.
Because you offset the labels by NUM_SPECIAL_TOKENS
here and here before calling the subsampling function, np.bincount
will prepend zeros to the missing elements from 0
to y_min_element-1
inclusive and will bias the mean towards zero. This could lead to a smaller dataset.
Example behaviour of np.bincount
:
>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])
I have proposed a solution to the described bug in this PR.