conversational-datasets Removing samples containing profanity

Removing samples containing profanity

Open vmurahari3 opened this issue 5 years ago • 1 comments

Do you think it makes sense to remove samples containing profanity?

Jun 29 '19 19:06 vmurahari3

In general there is a lot of questionable language in the reddit dataset, as it is totally unfiltered and we are including all subreddits including 'nsfw' ones. It is still natural language, and a potentially useful learning signal, though of course we need to be careful how the resulting model is used.

We could perhaps add flags to the pipeline for filtering based on the nsfw label etc. These would be off by default.

Jul 01 '19 01:07 matthen

conversational-datasets conversational-datasets copied to clipboard

Removing samples containing profanity

conversational-datasets
conversational-datasets copied to clipboard