languagetool icon indicating copy to clipboard operation
languagetool copied to clipboard

Wrong language detection

Open danielnaber opened this issue 6 years ago • 13 comments

Issue to collect incorrect language detection even with fasttext. Nothing we can easily fix, but we should at least be aware of the issues:

  • Osteuropa -> fr
  • USA, Osteuropa -> sv

danielnaber avatar Oct 05 '18 13:10 danielnaber

Doesn't it make sense to require a minimal text length before another language is suggested? I have just entered a name (Stephanie) and LT suggests to switch to English. Especially with respect to proper names there may be multiple languages for which a name might be correct (e.g., Maria is a valid/common name in English, German, Italian and maybe others)

f-knorr avatar Oct 05 '18 16:10 f-knorr

Maybe, but I don't know where to draw the line. A text might even start with several names... when you say "LT suggests", what client are you referring to?

danielnaber avatar Oct 05 '18 17:10 danielnaber

grafik

f-knorr avatar Oct 05 '18 17:10 f-knorr

It might be wise to exclude any capitalized word (proper names mostly) as a quick hack.

ghost avatar Oct 09 '18 10:10 ghost

and perhaps exclude familial prepositions. (EG: Van den Wateren never got the hang of his paternal grand parents' tongue.)

SkyCharger001 avatar Oct 09 '18 16:10 SkyCharger001

English is detected as French on this LinkedIn post:

image

Aside: shouldn't the messages after the suggestions be in French?

MikeUnwalla avatar May 12 '21 08:05 MikeUnwalla

Could you send the full text as text (not just as a screenshot)?

danielnaber avatar May 12 '21 08:05 danielnaber

At Congrès Inforsid 2021 (https://inforsid2021.sciencesconf.org/resource/page/id/14) on June 1, Tuesday 2:00 - 5:30 pm. I will present an online workshop (in English) about ASD-STE100.

Controlled language for text simplification: Concepts and implementation

ABSTRACT. In commerce and industry, many organizations use plain language, for example, ‘plain English’ and ‘lenguaje claro’ [Spanish]. For safety-critical documentation, plain language is not always sufficient, and some organizations use controlled language. ASD-STE100 Simplified Technical English is a specification for a controlled language. In this paper, we present the TechScribe term checker for ASD-STE100, which checks a document for conformity to ASD-STE100. Many of the ASD-STE100 rules are applicable to the simplification of scientific texts. To show that, this paper conforms to ASD-STE100 as much as possible.

MikeUnwalla avatar May 12 '21 08:05 MikeUnwalla

Sorry Daniel, I should have thought to send the full text.

MikeUnwalla avatar May 12 '21 08:05 MikeUnwalla

If you add the English text directly to a post, there is no problem.

To reproduce the problem, add English text and French text in the same post: image

The French text made the post too long. After I deleted the French text, LT continued to give the warnings.

French text: RÉSUMÉ. Dans le commerce et l'industrie, de nombreuses organisations utilisent des langues simplifiées, par exemple "plain English" et "lenguaje claro" [éspagnol]. Pour la documentation technique, critique pour la sécurité, "plain language" n'est pas toujours suffisant et certaines organisations utilisent une langue contrôlée. L'ASD-STE100 Simplified Technical English est la spécification d’une langue contrôlée. Dans cet article, nous présentons TechScribe, un logiciel conçu pour vérifier automatiquement la conformité d'un document aux règles ASD-STE100. De nombreuses règles de l'ASD-STE100 sont applicables à la simplification des textes scientifiques. Pour le démontrer, dans la mesure du possible, cet article est conforme à la norme ASD-STE100.

MikeUnwalla avatar May 13 '21 08:05 MikeUnwalla

Thanks, I can reproduce it now. It seems to be add-on-related, I have opened an issue there (https://github.com/languagetooler-gmbh/browser-add-on-rewrite/issues/1263).

danielnaber avatar May 13 '21 09:05 danielnaber

  1. Start with English (American) and 'Automatically detect language' not selected.
  2. Delete all the text.
  3. Select 'Automatically detect language'.
  4. End with Australian English: image

MikeUnwalla avatar Jul 11 '22 12:07 MikeUnwalla

And another: image Click 'Automatically detect language' and LT detects Romanian image

MikeUnwalla avatar Jul 11 '22 12:07 MikeUnwalla