php-sentence
php-sentence copied to clipboard
encoding issue: not a blocker but something to be aware of for fellow users
Original text is transformed in the return sentences, so finding the start/end of each sentence requires comparing to a transformed sentence.
ie For text
Super Tortas is Sun Valley is outstanding! I haven’t had as many Tortas I would like.
I failed to find a start for the second sentence since the second sentence comes back as
I haven't had as many Tortas I would like.
which isn't found in the original. code (not working, $iStart returns as false for the second sentence)
$oSentenceSplitter = new Sentence;
$aRawSentences = $oSentenceSplitter->split($sText,Sentence::SPLIT_TRIM);
$iOffset = 0;
foreach($aRawSentences as $aRawSentence) {
$iStart = mb_strpos($sText,$sRawSentence,$iOffset);
$iLength = mb_strlen($sRawSentence);
$iOffset += $iLength;
}
corrected code
$sCleanedText = Multibyte::cleanUnicode($sText);
$oSentenceSplitter = new Sentence;
$aRawSentences = $oSentenceSplitter->split($sCleanedText,Sentence::SPLIT_TRIM);
$iOffset = 0;
foreach($aRawSentences as $aRawSentence) {
$iStart = mb_strpos($sCleanedText,$sRawSentence,$iOffset);
$iLength = mb_strlen($sRawSentence);
$sSentenceOut = mb_substr($sText,$iStart,$iLength);
$iOffset += $iLength;
}
note that the above works as long as the transformations and offsets all work out (ie the mb_strlen of the transformed sentence is the same as the mb_strlen of the sentence in the original)