latexdiff icon indicating copy to clipboard operation
latexdiff copied to clipboard

“encoding” Problem

Open Key033 opened this issue 5 years ago • 5 comments

The version of latexdiff is

This is LATEXDIFF 1.3.0 (Algorithm::Diff 1.15 so, Perl v5.28.1) (c) 2004-2018 F J Tilmann Preamble Internal Type UNDERLINE Preamble Internal Type SAFE Preamble Internal Type FLOATSAFE

Working on Windows10 1909.

When I try to latexdiff the tex with the command like "latexdiff old.tex new.tex > diff.tex" or "latexdiff --encoding=utf8 old.tex new.tex > diff.tex", the "diff.tex" is encoded by UTF-16 LE, where the "old.tex" and "new.tex" are encoded by UTF-8. And the UTF-8 characters like Chinese and Japanese will be garbled.

For example, "old.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个测试文档。
\end{document}

"new.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个新的测试文档。
\end{document}

“diff.tex"

latex\documentclass{article}
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL old.tex   Sat Apr  4 22:12:08 2020
%DIF ADD new.tex   Sat Apr  4 22:12:03 2020
\usepackage[UTF8]{ctex}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}}                      %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF LISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
	language=DIFcode, %DIF PREAMBLE
	basicstyle=\ttfamily, %DIF PREAMBLE
	columns=fullflexible, %DIF PREAMBLE
	keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF

\begin{document}
浣犲ソ锛孿DIFdelbegin \DIFdel{杩欐槸涓€涓祴璇曟枃妗c€?
 }\DIFdelend \DIFaddbegin \DIFadd{杩欐槸涓€涓柊鐨勬祴璇曟枃妗c€?
 }\DIFaddend\end{document}

Key033 avatar Apr 04 '20 14:04 Key033

I found if the old.tex and new.tex are encoded by UTF-8 with BOM, the diff.tex can be output with correct UTF8 characters and is encoded by UTF-16, which can be re-encoded to UTF-8 easily.

Key033 avatar Apr 04 '20 15:04 Key033

So is it solved? What is BOM?

ftilmann avatar Apr 04 '20 15:04 ftilmann

So is it solved? What is BOM?

The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (0xEF,0xBB,0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Ref: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom

I re-encoded the files by Vscode's "Save with Encoding" function.

And I think there is something wrong with the variable $encoding, but I haven't learned Perl.

Key033 avatar Apr 04 '20 15:04 Key033

Thanks for this report. The encoding is mostly dealt with by perl and (as you could see from my question) I have no real insight into the encoding. So I will not tackle this anytime soon but will leave the issue open in case anyone has an insight.

ftilmann avatar May 23 '20 13:05 ftilmann

I have just encountered this issue. You should use the good old CMD on Windows or PowerShell 6.2+ as the default Powershell in Windows 10/11 output file encoded with UTF-16 when you use >. Sometimes it is not as simple as re-encoding to UTF-8 as character like é in .tex file will turn to jibberish ├⌐ if using latexdiff on PowerShell <6.2 and cannot be recovered even re-encoding to UTF-8. I will say nothing is wrong with latexdiff or perl.

henrysky avatar Aug 03 '22 02:08 henrysky

Edit: The command below works, but also breaks utf-8 characters. I will stick with cmd and consider adding this to the FAQ.

You can use the following in powershell to get a utf-8 output file, but it will still break when there are non-standard characters in the .tex files.

latexdiff a.tex b.tex | Out-File output.tex -Encoding utf8

jonschz avatar Nov 05 '22 09:11 jonschz

Edit

The bigger issue seems to be that Powershell does not use Unicode to pipe the output from one command into another, see https://markw.dev/unicode_powershell/. I was able to get latexdiff to work in powershell using the following:

> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
> latexdiff .\latex_test_files\utf8_a.tex .\latex_test_files\utf8_b.tex | Out-File -Encoding utf8 out.tex

I would still recommend using cmd instead, and I will work on the pull request now.

Original text

Addendum: It appears that this is known problem with Perl in general under Windows.

See e.g. https://stackoverflow.com/a/66281302 and https://github.com/StrawberryPerl/Perl-Dist-Strawberry/issues/18.

See also https://stackoverflow.com/q/4942305; many other languages like Python and Node.js have since solved this issue.

I messed around a bit in Perl, tried some things, but it seems like there is no working pure-Perl solution. It seems like the Perl developers cannot easily change this, either, as it will break legacy code.

Solution for now

it seems to be best to just use cmd under Windows. Maybe I'll create a pull request to update the documentation.

Future

I have two ideas how one could mitigate this problem:

  1. One could implement direct output to files like latexdiff --outfile=out.tex a.tex b.tex. I suspect this will be quite a bit of work to implement, though.
  2. Another (hypothetical) possiblity is to modify the latexdiff.exe wrapper to fix the output. Not sure how complicated that will be.

jonschz avatar Nov 05 '22 17:11 jonschz

xref: https://tex.stackexchange.com/questions/542161/error-in-texstudio-when-using-latexdiff-on-windows-10#comment1652779_542161

sgbaird avatar Nov 10 '22 01:11 sgbaird