TextRank icon indicating copy to clipboard operation
TextRank copied to clipboard

TextRank throws an error when using a "»" character

Open BenParizek opened this issue 7 years ago • 3 comments

I am processing a block of text with TextRank and it is throwing an error. The text is in French. The language is detected correctly. The part of the text that seems to be throwing the error is:

... «les derniers jours de guerre» ...

TextRank returns the following, with the final raquo being encoded incorrectly:

accord historique,Colombie,jours,guerre�,derniers

It appears the invalid character gets introduced in the DefaultEvents::get_words method:

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

The text appears fine before the preg_split method is called and gets encoded incorrectly in the $words variable afterwards.

I've tried to add the raquo's to the French stopwords and update the preg_split method to mb_split – both of these attempted solutions appear not to work or have other issues. It's worth noting the opening raquo seems to get processed fine. It's the final raquo that seems to cause the issue.

BenParizek avatar Jan 23 '17 23:01 BenParizek

@BenParizek I think I know where the problem is, let me do some tests locally and I would add some phpunit tests as well.

If you are in a rush I would change preg_split to use the multibyte modifier (or perhaps use mb_split instead)

public function get_words($text)
{
    $words = preg_split('/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    return array_values(array_filter(array_map('trim', $words)));
}

crodas avatar Jan 24 '17 03:01 crodas

any plans to push a new version?

fawzib avatar Jan 06 '18 05:01 fawzib

@fawzib I think the change was pushed on the develop branch: https://github.com/crodas/TextRank/commit/073b9026050e8500257f8853bceab1aeb3827708

It would be nice to see this package released with a version number instead of just needing to require dev-master.

BenParizek avatar Jan 06 '18 06:01 BenParizek