DanielHesslow comments

Repositories
Issues
Comments

Results 3 comments of


                                            DanielHesslow

What is training input data format?

I'm not particularly familiar with the huggingface code base, and I do not currently have the time to read up one the specifics. The format used during training is: ```...

What is training input data format?

`tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")` Is indeed the correct tokenizer. the vocab size is 26.

Access utf-8 byte sequence for each token

This remapping is unfortunately not correct for all tokenizers, and there isn't actually a single mapping. Doing it correctly requires treating each internal decoder separately. It's very possible but it...