webstruct icon indicating copy to clipboard operation
webstruct copied to clipboard

refactoring to work with the annotated plain text

Open tpeng opened this issue 10 years ago • 1 comments

sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn pipeline and some evaluation tools out of box.

the input text annotated text is similar to GATE: e.g. this is a <NER>test</NER>. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.

tpeng avatar May 26 '14 13:05 tpeng

My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion.

kmike avatar May 26 '14 15:05 kmike