Open-Assistant
Open-Assistant copied to clipboard
Add dialogue data collator tests
Add dialogue data collator unit test. Things to note on this PR:
- is it correct that we mask the last occurance of
<|endoftext|>
of the assistant? See the example in the test, there will be one occurance where we have this token and one where there is none. See the todo in the code. - I built a dummy tokenizer from the pythia-70m one using
tokenizer = old_tokenizer.train_new_from_iterator(training_iter, vocab_size)
to keep the size minimal. Just trained on the text that appears in the test.
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md