Open-Assistant
Open-Assistant copied to clipboard
Special tokens in datasets
As it can be seen in some of the answers our model outputs quite a number of tokens that are reserved for special purposes and should not appear in text. We need a script with which we can check a dataset for these and point us to problematic samples.
Investigation of this should be done manually IMO.
Agree this would be a really useful inclusion
I also think it might be quite beneficial to investigate all data we use for instances of "As a large language model" and related phrases which indicate a high probability of the reply being undesirable/useless, but maybe outside the scope of this issue
@olliestanley We encountered that our model believes it is created by openai because of the statements you mentioned. The linked PR removes this (from the 14 datasets mentioned in the PR).