iepy icon indicating copy to clipboard operation
iepy copied to clipboard

LiteralNER shall tokenize each entry

Open jmansilla opened this issue 10 years ago • 1 comments

Right now our LiteralNER is very literal, so in some cases is not working.

Example: an entry like this

takayasu's arteritis

Is never found because the documents will be tokenized, transforming this

John had takayasu's arteritis

into this

John had takayasu 's arteritis

making impossible a match (notice that 's is a separated token).

Also, what make things harder is that the tokenizer to use while parsing the LiteralNER entries must be the same tokenizer used when tokenizing text.

jmansilla avatar May 16 '14 17:05 jmansilla

For documentation's sake,, something we talked in real life:

Perhaps tokenization, pos-tagging, ner-tagging and segmentation could be non-optional parts of the preprocessing pipeline since iepy's core would break without them anyway.

If those parts are non-optional, then they can be passed as kwargs to the preprocessing pipeline and therefore it can be easily ensured that tokenization is the same for both documents and literal tagging.

rafacarrascosa avatar May 16 '14 20:05 rafacarrascosa