natural icon indicating copy to clipboard operation
natural copied to clipboard

tokenize phrasal-verbs

Open ayman-ibrahim opened this issue 6 years ago • 7 comments

is there is a way to tokenize a sentence taking into consideration phrasal-verbs. example:

"The flight take off at three o'clock"

output should be: [the, flight, take off, at, three, o'clock]

take off should be tokenized as one word.

ayman-ibrahim avatar Nov 14 '18 21:11 ayman-ibrahim

Imho that is not what tokenization is meant for. Tokenization splits a text into words (and punctuation, if necessary) and "take off" consists two words. Combining them into a phrasal verb requires partial parsing or chunking.

Hugo-ter-Doest avatar Nov 14 '18 22:11 Hugo-ter-Doest

@Hugo-ter-Doest Ok, do you know if there's a way to combine phrasal verbs in natural library ?

ayman-ibrahim avatar Nov 14 '18 22:11 ayman-ibrahim

It's not yet in natural, but I'm working on that to use it for named entity recognition. You can have a preview at a CYK and Earley parsers here in this branch: https://github.com/Hugo-ter-Doest/natural/tree/NER/

parsers are in lib/natural/parsers a chunker based on the Earley parser is in lib/natural/NER

Feel free to already use that, but it may still change.

Hugo-ter-Doest avatar Nov 14 '18 22:11 Hugo-ter-Doest

cool, I'll have a look. Thanks.

ayman-ibrahim avatar Nov 14 '18 22:11 ayman-ibrahim

You could tokenize your sentence, tag each token's part of speech, and then find patterns. For example, VERB + DET or VERB + PREPOSITION. I use that to find noun phrases (JJ|NN+).

lazharichir avatar Nov 20 '18 22:11 lazharichir

@Hugo-ter-Doest Do you have a set timeline as to when you would be able to integrate the code into Natural's codebase?

privateOmega avatar Jan 07 '19 11:01 privateOmega

You can implement that, for now, using some sort of pattern matching (e.g. spaCy) such as you would walk the array of tokens, and find whatever patterns you are looking for (e.g. NOUN followed by PREP, or as many NOUNS/ADJ followed by PREP, etc).

You can look at spaCy's code (python) and port it to Node and Natural's token structure: https://github.com/explosion/spaCy/tree/master/spacy/matcher

lazharichir avatar Apr 11 '19 09:04 lazharichir