webstruct
webstruct copied to clipboard
Tokenizer fixes and span_tokenize method
Tokenizer from #15 had issues like not splitting a dot at the end of a sentence as a separate token
40006,40007c40017
< community
< .
---
> community.
41148,41149c41158
< Reserved
< .
---
> Reserved.
Now this issue should be fixed.
Also I've refactored code and added span_tokenize method (@kmike I remember you said it would be nice to have this method)
Performance wasn't hurt
X, y = webstruct.HtmlTokenizer().tokenize(trees)
CPU times: user 3.42 s, sys: 32 ms, total: 3.46 s
Wall time: 3.45 s
@chekunkov do you by chance recall why wasn't this PR merged?
@kmike nope, have no idea why.