Scores based on np.average are wrong when input is a dictionary. The reason is that len is called on the batch, which returns the number of keys on a dict.
np.average
len
See lines here and here.