webstruct icon indicating copy to clipboard operation
webstruct copied to clipboard

Tokenizer fixes and span_tokenize method

Open chekunkov opened this issue 10 years ago • 2 comments

Tokenizer from #15 had issues like not splitting a dot at the end of a sentence as a separate token

40006,40007c40017
< community
< .

---
> community.
41148,41149c41158
< Reserved
< .

---
> Reserved.

Now this issue should be fixed.

Also I've refactored code and added span_tokenize method (@kmike I remember you said it would be nice to have this method)

Performance wasn't hurt

X, y = webstruct.HtmlTokenizer().tokenize(trees)

CPU times: user 3.42 s, sys: 32 ms, total: 3.46 s
Wall time: 3.45 s

chekunkov avatar Jun 07 '14 12:06 chekunkov

@chekunkov do you by chance recall why wasn't this PR merged?

kmike avatar Nov 25 '16 17:11 kmike

@kmike nope, have no idea why.

chekunkov avatar Nov 25 '16 19:11 chekunkov