webstruct
webstruct copied to clipboard
refactoring to work with the annotated plain text
sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn pipeline
and some evaluation tools out of box.
the input text annotated text is similar to GATE: e.g. this is a <NER>test</NER>
. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.
My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion.