PragmaticSegmenterNet icon indicating copy to clipboard operation
PragmaticSegmenterNet copied to clipboard

Apostrophes at start or end of word seem to mess up the segmenter

Open Telavian opened this issue 4 years ago • 5 comments

Given this text When it first arrived, I thought it was huge, and was thinkin' 'bout returning it, even though it is the size they say it is, It just seemed really large in person. I kept it and started using it. It is very easy to use with the instruction manual in hand, and I don't need that anymore for the things I do. I've scanned, copied, enlarged and printed double sided. All verry intuitive now. Prints clean and clear, bought a two pack of extra capacity black ink cartridges from Epson, delivered they were only $37, which I thought was reasonable, and it doesn't even look that big anymore. I am likin' it more all the time, and real happy with my choice.

If I convert the last "likin'" to "likin" then it segments into 2 phrases. If I convert the first "thinkin'" to "thinkin" it segments to 1 phrase. If I convert the first "'bout" to "bout" then it segments to 7 phrases.

Telavian avatar May 23 '21 01:05 Telavian

Ooh, this is a tricky one 🙈 Since it assumes the apostrophes form a quoted section and the library doesn't do sentence boundary detection internally to the quoted sections. I've been mulling over rules to detect this situation but it's a hard one.

EliotJones avatar Jun 16 '21 15:06 EliotJones

I am honestly not sure either.

' at end of word like thinkin' may be treated as normal character. ' in middle of word like I've may be treated as normal character.

However probably not realistic to tell the difference between a quoted section vs excessive ' usage. An option may help. AllowQuotes or something.

Telavian avatar Jun 16 '21 19:06 Telavian

The trouble is ' at the end of almost all words will be a closing single quote, which 'might signify the presence of a quoted pair' but without hardcoding every possible quoted word end it's highly likely to give false positives.

Currently the library doesn't do quote pair detection it just has a set of regexes for probably quoted pairs and naively ignores sentence breaks between pairs, I think. By distinguishing between open and close quotes we'd at least detect the first thinkin' as being unrelated. For reasons I need to drill into the library doesn't treat 'bout as an opening single quote as far as I can tell. So quote pair detection might work here, I'm just worried what it might break since text like this is very unusual.

EliotJones avatar Jun 16 '21 21:06 EliotJones

An alternative that just occured is if the initial character following the quote is lowercase, as in 'bout and the quoted pair cross several sentence boundaries (checked by running against the inner text), then the pair is invalid. I'll have more of a think about it.

EliotJones avatar Jun 16 '21 21:06 EliotJones

If first seen quote does not have a preceding whitespace character then could assume it is not start of a quote. For end of quote could use the lowercase solution you mentioned.

Telavian avatar Jun 17 '21 03:06 Telavian