Apostrophes at start or end of word seem to mess up the segmenter
Given this text
When it first arrived, I thought it was huge, and was thinkin' 'bout returning it, even though it is the size they say it is, It just seemed really large in person. I kept it and started using it. It is very easy to use with the instruction manual in hand, and I don't need that anymore for the things I do. I've scanned, copied, enlarged and printed double sided. All verry intuitive now. Prints clean and clear, bought a two pack of extra capacity black ink cartridges from Epson, delivered they were only $37, which I thought was reasonable, and it doesn't even look that big anymore. I am likin' it more all the time, and real happy with my choice.
If I convert the last "likin'" to "likin" then it segments into 2 phrases. If I convert the first "thinkin'" to "thinkin" it segments to 1 phrase. If I convert the first "'bout" to "bout" then it segments to 7 phrases.
Ooh, this is a tricky one 🙈 Since it assumes the apostrophes form a quoted section and the library doesn't do sentence boundary detection internally to the quoted sections. I've been mulling over rules to detect this situation but it's a hard one.
I am honestly not sure either.
' at end of word like thinkin' may be treated as normal character. ' in middle of word like I've may be treated as normal character.
However probably not realistic to tell the difference between a quoted section vs excessive ' usage. An option may help. AllowQuotes or something.
The trouble is ' at the end of almost all words will be a closing single quote, which 'might signify the presence of a quoted pair' but without hardcoding every possible quoted word end it's highly likely to give false positives.
Currently the library doesn't do quote pair detection it just has a set of regexes for probably quoted pairs and naively ignores sentence breaks between pairs, I think. By distinguishing between open and close quotes we'd at least detect the first thinkin' as being unrelated. For reasons I need to drill into the library doesn't treat 'bout as an opening single quote as far as I can tell. So quote pair detection might work here, I'm just worried what it might break since text like this is very unusual.
An alternative that just occured is if the initial character following the quote is lowercase, as in 'bout and the quoted pair cross several sentence boundaries (checked by running against the inner text), then the pair is invalid. I'll have more of a think about it.
If first seen quote does not have a preceding whitespace character then could assume it is not start of a quote. For end of quote could use the lowercase solution you mentioned.