hedwig
hedwig copied to clipboard
Use NLTK sent_tokenize and word_tokenize
We should replace our primitive regex based tokenization with NLTK's tokenize module in the dataset pre-processing classes (after creating a snapshot release of this repository for the camera-ready)
Code duplication can be reduced if the pre-processing methods are moved to a util module rather than having it in each dataset class.