Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Special tokens in datasets

Open CloseChoice opened this issue 2 years ago • 2 comments

As it can be seen in some of the answers our model outputs quite a number of tokens that are reserved for special purposes and should not appear in text. We need a script with which we can check a dataset for these and point us to problematic samples.

Investigation of this should be done manually IMO.

CloseChoice avatar Apr 08 '23 06:04 CloseChoice

Agree this would be a really useful inclusion

I also think it might be quite beneficial to investigate all data we use for instances of "As a large language model" and related phrases which indicate a high probability of the reply being undesirable/useless, but maybe outside the scope of this issue

olliestanley avatar Apr 08 '23 08:04 olliestanley

@olliestanley We encountered that our model believes it is created by openai because of the statements you mentioned. The linked PR removes this (from the 14 datasets mentioned in the PR).

CloseChoice avatar Apr 10 '23 15:04 CloseChoice