Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add dialogue data collator tests

Open CloseChoice opened this issue 1 year ago • 1 comments

Add dialogue data collator unit test. Things to note on this PR:

  • is it correct that we mask the last occurance of <|endoftext|> of the assistant? See the example in the test, there will be one occurance where we have this token and one where there is none. See the todo in the code.
  • I built a dummy tokenizer from the pythia-70m one using tokenizer = old_tokenizer.train_new_from_iterator(training_iter, vocab_size) to keep the size minimal. Just trained on the text that appears in the test.

CloseChoice avatar Apr 12 '23 17:04 CloseChoice

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Apr 12 '23 18:04 github-actions[bot]