Question regarding error metrics/dataset creation

Open nsrishankar opened this issue 3 years ago • 1 comments

I had a few questions/clarifications regarding the hdf5 dataset that was linked on the notebook:

I ran the notebook for training from scratch using the existing hdf5 and obtained a CER of ~0.09 using just a single model (and not an ensemble).
When creating the hdf5 from scratch and running the training procedure my CER is similar to the best/second best models (~0.16-0.18).

So, as far as I can see the main difference would be in the dataset generation/preprocessing steps or the tokenizer: a. In the notebook there's a comment that the pretained models used a vocab size of 100 as opposed to 99 (95 characters + SOS/EOS/PAD/UNK tokens)- is there an additional token used here? b. Was the generation procedure of the hdf5 that was linked/on the google drive a little different?

Thank you!

Jan 24 '22 17:01 nsrishankar

I also don't remember exactly the first iteration but I am working on a paper with the different experiments involving pre-processing steps that will help the community. I will update you once it is finalized.

Mar 18 '22 16:03 him4318