TextRank
TextRank copied to clipboard
TextRank throws an error when using a "»" character
I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:
... «les derniers jours de guerre» ...
TextRank returns the following, with the final raquo being encoded incorrectly:
accord historique,Colombie,jours,guerre�,derniers
It appears the invalid character gets introduced in the DefaultEvents::get_words
method:
public function get_words($text)
{
$words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
return array_values(array_filter(array_map('trim', $words)));
}
The text appears fine before the preg_split
method is called and gets encoded incorrectly in the $words variable afterwards.
I've tried to add the raquo's to the French stopwords and update the preg_split
method to mb_split
– both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.
@BenParizek I think I know where the problem is, let me do some tests locally and I would add some phpunit tests as well.
If you are in a rush I would change preg_split
to use the multibyte modifier (or perhaps use mb_split instead)
public function get_words($text)
{
$words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
return array_values(array_filter(array_map('trim', $words)));
}
any plans to push a new version?
@fawzib I think the change was pushed on the develop branch: https://github.com/crodas/TextRank/commit/073b9026050e8500257f8853bceab1aeb3827708
It would be nice to see this package released with a version number instead of just needing to require dev-master
.