php-text-analysis
php-text-analysis copied to clipboard
Issue with German Umlauts using "PHP Rapid Automatic Keyword Extraction"
Hi, many thanks for this amazing script!
I tested your "PHP Rapid Automatic Keyword Extraction" example (shown here https://github.com/yooper/php-text-analysis/wiki/PHP-Rapid-Automatic-Keyword-Extraction) and noticed that there are issues with special chars like the German Umlauts.
I tested it with the German stop word list ("stop-words_german_1_de.txt").
It listed [verst rkte] => 8 as a keyword/score (n-gram = 2), which should be [verstärkte] => 8 and seems to interpret all words that contain a German Umlauts as multiple words in all cases by replacing each German Umlaut by a space " ", see the aforementioned example verst and rkte instead of "verstärkte".
Is there any way to fix this? I tried to convert input text to UTF-8 w/o any impact on this issue.