Question regarding error metrics/dataset creation
I had a few questions/clarifications regarding the hdf5 dataset that was linked on the notebook:
- I ran the notebook for training from scratch using the existing hdf5 and obtained a CER of ~0.09 using just a single model (and not an ensemble).
- When creating the hdf5 from scratch and running the training procedure my CER is similar to the best/second best models (~0.16-0.18).
So, as far as I can see the main difference would be in the dataset generation/preprocessing steps or the tokenizer: a. In the notebook there's a comment that the pretained models used a vocab size of 100 as opposed to 99 (95 characters + SOS/EOS/PAD/UNK tokens)- is there an additional token used here? b. Was the generation procedure of the hdf5 that was linked/on the google drive a little different?
Thank you!
I also don't remember exactly the first iteration but I am working on a paper with the different experiments involving pre-processing steps that will help the community. I will update you once it is finalized.