castor icon indicating copy to clipboard operation
castor copied to clipboard

Fix tokenizer for reuters dataset

Open Ashutosh-Adhikari opened this issue 6 years ago • 1 comments

Need to remove a few characters ( like ?, ! ) from sentences. In other words, add a few relevant delimiters.

Ashutosh-Adhikari avatar Oct 19 '18 05:10 Ashutosh-Adhikari

Take a look at datasets/reuters.py. Removing the special characters from the regular expression should do what you want.

def clean_string(string):
    """
    Performs tokenization and string cleaning for the Reuters dataset
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'`]", " ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.lower().strip().split()

achyudh avatar Oct 20 '18 08:10 achyudh