hedwig Use NLTK sent_tokenize and word

Use NLTK sent_tokenize and word_tokenize

Open achyudh opened this issue 5 years ago • 0 comments

We should replace our primitive regex based tokenization with NLTK's tokenize module in the dataset pre-processing classes (after creating a snapshot release of this repository for the camera-ready)

Code duplication can be reduced if the pre-processing methods are moved to a util module rather than having it in each dataset class.

Mar 22 '19 17:03 achyudh

hedwig hedwig copied to clipboard

Use NLTK sent_tokenize and word_tokenize

hedwig
hedwig copied to clipboard