treat icon indicating copy to clipboard operation
treat copied to clipboard

Implosion/to_s problem with Enclitics

Open n8 opened this issue 11 years ago • 2 comments

    text = "It's about time."
    text = sentence(text).apply(:tokenize, :parse)
    puts text.to_s

Results in:

It 's about time.

Should that to_s without the extra space between It and `s?

n8 avatar Jan 13 '14 22:01 n8

it's is a contraction - for tokenisation contractions are often considered two words (because they are really) - this is the case in Stanford Core - http://stackoverflow.com/questions/14058399/stanford-corenlp-split-words-ignoring-apostrophe

One option, as suggested in the above link, would be to handle imploding enclitics in the implode method - in treat this would be in module Treat::Entities::Entity::Stringable

chrisanderton avatar Jan 22 '14 23:01 chrisanderton

so - looks like the issue is with the current implode method on string able - although it attempts to handle enclitics then from what i can see in the current implementation then 'value' would already be blank, so calling strip! would make no difference - when the imploded parts are merged the space is still there (as it is outside the scope of the strip!)

here's a fixed version - modified the recursive call to pass the value string and operations are all performed on the string instead of multiple copies - but a disclaimer is that i only started looking at treat about 3 hours ago!

https://github.com/chris-at-thewebfellas/treat/commit/d9b912f24d7673863ca3ea7e59016f022923ac66

for the same code, this now gives:

It's about time.

chrisanderton avatar Jan 23 '14 01:01 chrisanderton