php-text-analysis icon indicating copy to clipboard operation
php-text-analysis copied to clipboard

Issue with German Umlauts using "PHP Rapid Automatic Keyword Extraction"

Open menturion opened this issue 10 months ago • 2 comments

Hi, many thanks for this amazing script!

I tested your "PHP Rapid Automatic Keyword Extraction" example (shown here https://github.com/yooper/php-text-analysis/wiki/PHP-Rapid-Automatic-Keyword-Extraction) and noticed that there are issues with special chars like the German Umlauts.

I tested it with the German stop word list ("stop-words_german_1_de.txt").

It listed [verst rkte] => 8 as a keyword/score (n-gram = 2), which should be [verstärkte] => 8 and seems to interpret all words that contain a German Umlauts as multiple words in all cases by replacing each German Umlaut by a space " ", see the aforementioned example verst and rkte instead of "verstärkte".

Is there any way to fix this? I tried to convert input text to UTF-8 w/o any impact on this issue.

menturion avatar Dec 08 '24 16:12 menturion