php-sentence icon indicating copy to clipboard operation
php-sentence copied to clipboard

encoding issue: not a blocker but something to be aware of for fellow users

Open victusfate opened this issue 2 years ago • 0 comments

Original text is transformed in the return sentences, so finding the start/end of each sentence requires comparing to a transformed sentence.

ie For text

Super Tortas is Sun Valley is outstanding! I haven’t had as many Tortas I would like.

I failed to find a start for the second sentence since the second sentence comes back as

I haven't had as many Tortas I would like.

which isn't found in the original. code (not working, $iStart returns as false for the second sentence)

    $oSentenceSplitter = new Sentence;
    $aRawSentences = $oSentenceSplitter->split($sText,Sentence::SPLIT_TRIM);
    $iOffset = 0;
    foreach($aRawSentences as $aRawSentence) {
        $iStart = mb_strpos($sText,$sRawSentence,$iOffset);
        $iLength = mb_strlen($sRawSentence);
        $iOffset += $iLength;
    } 

corrected code

    $sCleanedText = Multibyte::cleanUnicode($sText);
    $oSentenceSplitter = new Sentence;
    $aRawSentences = $oSentenceSplitter->split($sCleanedText,Sentence::SPLIT_TRIM);
    $iOffset = 0;
    foreach($aRawSentences as $aRawSentence) {
        $iStart = mb_strpos($sCleanedText,$sRawSentence,$iOffset);
        $iLength = mb_strlen($sRawSentence);
        $sSentenceOut = mb_substr($sText,$iStart,$iLength);
        $iOffset += $iLength;
    } 

note that the above works as long as the transformations and offsets all work out (ie the mb_strlen of the transformed sentence is the same as the mb_strlen of the sentence in the original)

victusfate avatar Jan 19 '23 16:01 victusfate