textidote No warnings found in CI

There seems to be an issue where no warnings are found when using the tool in CI. Running java -jar textidote.jar --check de --output html PuE.tex > language_report.html locally in the root of the repo works and warnings are found. In the CI log, it does not state that the file was skipped, but no warnings are found either. Explicitly using the correct file name "PuE.tex" in the CI script instead of the variable has no effect.

gitlab-ci.txt

Aug 15 '21 19:08 giulianorasper

Could it be a file encoding problem? I recall this issue from last year:

https://github.com/sylvainhalle/textidote/issues/120#issuecomment-613433539

If the encoding of the file does not match what TeXtidote expects, nothing is being read and that would explain the absence of warnings.

Aug 15 '21 23:08 sylvainhalle

To debug the issue, may I suggest you try this command:

java -jar textidote.jar --clean --check de --output html PuE.tex > cleaned.txt

If cleaned.txt is empty, then we'd have a hint about what is going on.

Aug 15 '21 23:08 sylvainhalle

I just created a minimal example for this issue in this repository. The pipeline of this repository generated the following artifacts. These artifacts.zip were generated on my local system (Ubuntu in WSL). Notably, in the cleaned.txt generated by GitLab CI, the characters 'ü' and 'ß' have been replaced with '?' whereas this did not happen in the locally generated version.

Aug 16 '21 10:08 giulianorasper

Thanks for providing these artifacts. I opened the files in a hex editor to see how the characters have been encoded. Here is what I found:

Source file (main.tex):

ü: C3BC -> UTF-8
ß: C39F -> UTF-8

CI (cleaned.txt):

ü: 3F -> "?"
ß: 3F -> "?"

Local (cleaned.txt):

ü: FC -> latin-1
ß: DF -> latin-1

I am a bit puzzled by what I see. The source file is a valid UTF-8 document. When processed locally, it ends up as a file transcoded into latin-1 (visible by the fact that the two characters end up with a different hex value). I don't know how this is possible, as TeXtidote always assumes the default encoding of the OS it runs in. Finally, when it is run in the CI pipeline, the characters are garbled --indicating again that the program does not assume UTF-8 as the input encoding. However, looking at your CI configuration, I see that you use a Debian OS, so UTF-8 input should not be a problem.

A workaround for your problem would be to explicitly TeXtidote to use UTF-8, by adding the --encoding UTF-8 command line switch when you call it. Tell me if this changes something.

Aug 19 '21 12:08 sylvainhalle

Thanks for the help so far! As suggested, I added --encoding UTF-8 parameter in the CI script. However, this did not affect the resulting CI artifacts.

To confirm that the main.tex is not altered by Git in some unexpected way when pushing / pulling, I also tried downloading my local main.tex version as part of the pipeline on another branch which yielded the same results.

Aug 19 '21 13:08 giulianorasper

This may not be related, but I see that the calls to TeXtidote mix the --clean option with the --check option. These two are mutually exclusive: calling clean only cleans the document and exits before performing any other verification.

Aug 19 '21 15:08 sylvainhalle

@giulianorasper Did you find a solution?

Jul 08 '22 22:07 ComanderKai77

Will close this due to lack of information to fix the issue.

Mar 08 '23 21:03 sylvainhalle

textidote textidote copied to clipboard

No warnings found in CI

textidote
textidote copied to clipboard