“encoding” Problem
The version of latexdiff is
This is LATEXDIFF 1.3.0 (Algorithm::Diff 1.15 so, Perl v5.28.1) (c) 2004-2018 F J Tilmann Preamble Internal Type UNDERLINE Preamble Internal Type SAFE Preamble Internal Type FLOATSAFE
Working on Windows10 1909.
When I try to latexdiff the tex with the command like "latexdiff old.tex new.tex > diff.tex" or "latexdiff --encoding=utf8 old.tex new.tex > diff.tex", the "diff.tex" is encoded by UTF-16 LE, where the "old.tex" and "new.tex" are encoded by UTF-8. And the UTF-8 characters like Chinese and Japanese will be garbled.
For example, "old.tex"
\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个测试文档。
\end{document}
"new.tex"
\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个新的测试文档。
\end{document}
“diff.tex"
latex\documentclass{article}
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL old.tex Sat Apr 4 22:12:08 2020
%DIF ADD new.tex Sat Apr 4 22:12:03 2020
\usepackage[UTF8]{ctex}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}} %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF LISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
language=DIFcode, %DIF PREAMBLE
basicstyle=\ttfamily, %DIF PREAMBLE
columns=fullflexible, %DIF PREAMBLE
keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF
\begin{document}
浣犲ソ锛孿DIFdelbegin \DIFdel{杩欐槸涓€涓祴璇曟枃妗c€?
}\DIFdelend \DIFaddbegin \DIFadd{杩欐槸涓€涓柊鐨勬祴璇曟枃妗c€?
}\DIFaddend\end{document}
I found if the old.tex and new.tex are encoded by UTF-8 with BOM, the diff.tex can be output with correct UTF8 characters and is encoded by UTF-16, which can be re-encoded to UTF-8 easily.
So is it solved? What is BOM?
So is it solved? What is BOM?
The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (0xEF,0xBB,0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Ref: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom
I re-encoded the files by Vscode's "Save with Encoding" function.
And I think there is something wrong with the variable $encoding, but I haven't learned Perl.
Thanks for this report. The encoding is mostly dealt with by perl and (as you could see from my question) I have no real insight into the encoding. So I will not tackle this anytime soon but will leave the issue open in case anyone has an insight.
I have just encountered this issue. You should use the good old CMD on Windows or PowerShell 6.2+ as the default Powershell in Windows 10/11 output file encoded with UTF-16 when you use >. Sometimes it is not as simple as re-encoding to UTF-8 as character like é in .tex file will turn to jibberish ├⌐ if using latexdiff on PowerShell <6.2 and cannot be recovered even re-encoding to UTF-8. I will say nothing is wrong with latexdiff or perl.
Edit: The command below works, but also breaks utf-8 characters. I will stick with cmd and consider adding this to the FAQ.
You can use the following in powershell to get a utf-8 output file, but it will still break when there are non-standard characters in the .tex files.
latexdiff a.tex b.tex | Out-File output.tex -Encoding utf8
Edit
The bigger issue seems to be that Powershell does not use Unicode to pipe the output from one command into another, see https://markw.dev/unicode_powershell/. I was able to get latexdiff to work in powershell using the following:
> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
> latexdiff .\latex_test_files\utf8_a.tex .\latex_test_files\utf8_b.tex | Out-File -Encoding utf8 out.tex
I would still recommend using cmd instead, and I will work on the pull request now.
Original text
Addendum: It appears that this is known problem with Perl in general under Windows.
See e.g. https://stackoverflow.com/a/66281302 and https://github.com/StrawberryPerl/Perl-Dist-Strawberry/issues/18.
See also https://stackoverflow.com/q/4942305; many other languages like Python and Node.js have since solved this issue.
I messed around a bit in Perl, tried some things, but it seems like there is no working pure-Perl solution. It seems like the Perl developers cannot easily change this, either, as it will break legacy code.
Solution for now
it seems to be best to just use cmd under Windows. Maybe I'll create a pull request to update the documentation.
Future
I have two ideas how one could mitigate this problem:
- One could implement direct output to files like
latexdiff --outfile=out.tex a.tex b.tex. I suspect this will be quite a bit of work to implement, though. - Another (hypothetical) possiblity is to modify the latexdiff.exe wrapper to fix the output. Not sure how complicated that will be.
xref: https://tex.stackexchange.com/questions/542161/error-in-texstudio-when-using-latexdiff-on-windows-10#comment1652779_542161