paraphrase-id-tensorflow
paraphrase-id-tensorflow copied to clipboard
Refactor out unnecessary processing in data pipeline
right now, the data pipeline will tokenize the input into both words / characters, even if you only want words. This is fine for now since character tokenization isn't that expensive, but it's not ideal for when we want to use NER/POS features, since running the taggers is can be quite slow and we don't want to do it unless necessary.