castor
castor copied to clipboard
Fix tokenizer for reuters dataset
Need to remove a few characters ( like ?
, !
) from sentences. In other words, add a few relevant delimiters.
Take a look at datasets/reuters.py. Removing the special characters from the regular expression should do what you want.
def clean_string(string):
"""
Performs tokenization and string cleaning for the Reuters dataset
"""
string = re.sub(r"[^A-Za-z0-9(),!?\'`]", " ", string)
string = re.sub(r"\s{2,}", " ", string)
return string.lower().strip().split()