language-detector icon indicating copy to clipboard operation
language-detector copied to clipboard

wrong language detection

Open FLasH3r opened this issue 4 years ago • 3 comments

I have the following text with the corresponding language as detected by this package (all English) Only the bold text is correct.

  • Announcing the GitHub Education Classroom Report 2020 - en
  • Highlights from Game Off 2020 - en
  • How to launch a tech career in 2021 - it
  • Let’s talk about securing open source projects - tl
  • Git clone: a data-driven study on cloning behaviors - tl
  • Get up to speed with partial clone and shallow clone - it
  • GitHub joins amicus brief warning of systemic risk from private sector offensive actors - af
  • Visualizing GitHub’s global community - tl
  • How we built the GitHub globe - en
  • How to make DevOps your competitive advantage - pt

besides using composer install ... I have done anything

The text here is just an example, it's from github blog (title of the last 10 posts)

if I do new \LanguageDetector\LanguageDetector(null,['en']); it will work, but that is not the goal.

the code looks like this:

$languageDetector = new \LanguageDetector\LanguageDetector();

foreach($titles AS $title) {

    $languages = $languageDetector->evaluate($title)->getLanguage();

    echo $title.' - '.(string)$languages.PHP_EOL;
}

FLasH3r avatar Jan 03 '21 19:01 FLasH3r

Looks like this suffers from the same thing as the more popular https://github.com/patrickschur/language-detection

It does a good job with long texts but is borderless useless for short sentences.. getting it wrong at an alarmingly high rate

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

vesper8 avatar May 02 '21 20:05 vesper8

ward

FabianoLothor avatar Jul 21 '21 09:07 FabianoLothor

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

@vesper8 https://github.com/fntlnz/cld2-php-ext works good for my use-cases also with rather short texts. It detects all the above cases as English

dmaicher avatar Dec 02 '21 12:12 dmaicher