languagetool
languagetool copied to clipboard
`*_WORD_REPEAT_BEGINNING_RULE` fails for common text line length
Symptom: I do observe a strange behaviour of <LANG>_WORD_REPEAT_BEGINNING_RULE
. In pure text files, whether the rule matches or not depends on line length. I can reproduce the behaviour for at least English and German language.
The attached file contains two paragraphs of the same German text wrapped to a different line length. Paragraphs consist of four sentences, all starting with the same word. (Die is just a German article, btw.) Word repetition rule matches only in the second paragraph with a longer line length. (Tests show that the order of paragraphs doesn't matter.) Even though, the text is in German, checking with a language code such as en-GB
shows the same behaviour.
To reproduce, download the attached text file and run the LT command-line checker like this:
$ java -jar /path/to/languagetool-commandline.jar -l de-DE die-bahn.txt
Expected text language: German (Germany)
Working on die-bahn.txt...
1.) Line 13, column 107, Rule ID: GERMAN_WORD_REPEAT_BEGINNING_RULE premium: false prio=-61
Message: Drei aufeinanderfolgende Sätze beginnen mit dem gleichen Wort. Evtl. können Sie den Satz umformulieren, zum Beispiel, indem Sie ein Synonym nutzen.
...zeichnete den Vorschlag als nicht annehmbar. Die Kernforderung der GDL in der Tarifauseinande...
^^^
Time: 16722ms for 8 sentences (0.5 sentences/sec)
I can reproduce the same behaviour with a local LT server installation and Emacs text editor.
Expected behaviour: The repetition error is recognized in both paragraphs, since the text is essentially the same.
System:
- Xubuntu 20.04
- LanguageTool v6.4-snapshot, 2024-03-12
- LanguageTool v6.4-snapshot, 2024-02-28
- OpenJDK 11.0.22
- Emacs 26.3, langtool.el v2.3.7 die-bahn.txt
Further tests show that minimum line length the repetition error is caught is somewhere around 115 characters.
The following loop wraps the file to different line lengths then sends it to the LT server and greps the result for the string RULE
:
$ for i in $(seq 111 120); do echo $i; TEXT=$(fmt -w $i die-bahn.txt); curl -s --data "language=de-DE&text=$TEXT" http://localhost:8081/v2/check | jq | grep RULE; done
111
112
113
114
115
116
117
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
118
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
119
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
120
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
"id": "GERMAN_WORD_REPEAT_BEGINNING_RULE",
Two matches, because the file attached to the original report contains the same paragraph twice.
A similar command-line tool call (running way slower) would look like this
$ for i in $(seq 111 120); do echo $i; fmt -w $i die-bahn.txt | java -jar /opt/LanguageTool/languagetool-commandline.jar -l de-DE --json 2> /dev/null | jq | grep RULE; done
That is, the error is signaled for a maximum line length of 117 characters and above. Inspecting actual maximum paragraph line length, which is not necessarily equal to the fmt -w
argument, via
$ fmt -w 117 die-bahn.txt | wc -L
116
reveals that the longest line contains in fact 116 characters. Wrapping the text to a line length of 116 characters – where the repetition error wasn't caught – maximum line length drops to 111 characters. So, somewhere between 112 and 116 characters seems to be a magic limit for repetition rule to start working.
Referring to the edited title, a common line length in a text editor is somewhere in the range 60 to 80 characters per line. *_WORD_REPEAT_BEGINNING_RULE
fails for people working with such a setup.
The problem affects Thunderbird as well. I've copied the text from the file attached to the original report, opened a new mail in Thunderbird and pasted the text there. It is then automatically wrapped to some standard line length, but LanguageTool was only able to catch the repetition in the second paragraph (with the longer source line length):
Xubuntu 20.04 Thunderbird 115.8.1 LanguageTool-Addon 8.3.0
For what it's worth, meanwhile, I have been able to reproduce the problem in the Firefox add-on, too. Here's how:
- Open https://pastebin.com/.
- Copy all text from the file attached to the original report into the input field labelled "New Paste".
- Wait for LT add-on to do the checking.
The result should look like this:
Xubuntu 20.04 Firefox 124.0 (deb) LanguageTool add-on 8.6.0
Not too enthusiastic about giving back (a little) via bug reports anymore, given the phenomenal feedback rate visible here and on the forum for issues other than plain word error suggestions. How easy can a bug be to reproduce? No interaction with a third-party application necessary. Really, I haven't expected this issue to stay open for more than a week or so. (And no this is not meant demanding. Not giving any feedback at all or just the most terse possible by default is what I'm putting into question. I know you make money with this code and that's OK. But keep in mind that obviously non-paying users giving technical feedback may be staff who put their thumbs up or down before installing the software on individual user's computers, whether paying or non-paying ones.)
Anyway, I can confirm this bug to affect the stand-alone LT application, too. As before, the text file can be found attached to the first comment.
Ubuntu 22.04 LanguageTool 6.4 OpenJDK 11
In general, the end of line character creates a new sentence for LanguageTool. Considering this, the rule matches as expected when there are three sentences starting with the same word. In other words, LanguageTool doesn't work well with hard-coded newlines.
Thank you for giving this insight. That could explain the issue. If sentences are indeed broken at line breaks, that would render grammar rules largely pointless. Will watch that.
On the other hand, I don't think this is the whole truth. Because given the behaviour you described, shouldn't lines starting with a lowercase word trigger UPPERCASE_SENTENCE_START rule, e.g., on line 3 or 4 in the last screenshot?