sacremoses
sacremoses copied to clipboard
Bug in final apostrophe!!
Bug in final apostrophe from original Moses!!
Original Moses:
$ cat in.txt
dip dye hand-tufted ivory / navy area rug, 8' x 10'
azzura hill hand-tufted ivory indoor/outdoor area rug, 7'6" x 9'6"
caterine hand-tufted ivory area rug, 9' x 12'
de cor hand-tufted ivory/navy area rug, 8' x 10'
$ cat in.txt | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en
dip dye hand-tufted ivory / navy area rug , 8 ' x 10'
azzura hill hand-tufted ivory indoor / outdoor area rug , 7 ' 6 " x 9 ' 6 "
caterine hand-tufted ivory area rug , 9 ' x 12'
de cor hand-tufted ivory / navy area rug , 8 ' x 10'
SacreMoses output:
>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> x = """dip dye hand-tufted ivory / navy area rug, 8' x 10'
... azzura hill hand-tufted ivory indoor/outdoor area rug, 7'6" x 9'6"
... caterine hand-tufted ivory area rug, 9' x 12'
... de cor hand-tufted ivory/navy area rug, 8' x 10'
... """
>>> for sent in x.split('\n'):
... print(mt.tokenize(sent.strip(), return_str=True))
...
dip dye hand-tufted ivory / navy area rug , 8 ' x 10'
azzura hill hand-tufted ivory indoor / outdoor area rug , 7 ' 6 " x 9 ' 6 "
caterine hand-tufted ivory area rug , 9 ' x 12'
de cor hand-tufted ivory / navy area rug , 8 ' x 10'