pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

error converting UTF-8 accented char to djot

Open massifrg opened this issue 1 year ago • 4 comments

I get this error with the latest version of pandoc (3.1.12.2):

pandoc: Cannot decode byte '\xc3': Data.Text.Encoding: Invalid UTF-8 stream

when I try to convert a file to the djot format:

pandoc -f html -t djot test.html 

where test.html is a UTF-8 encoded file with a content like this

<h2>test di convertibilità</h2>

What causes the error is the "à" character.

This was introduced in the latest patch of pandoc: version 3.1.12.1 works without errors (I'm testing the amd64 binary in Debian bookworm).

massifrg avatar Mar 03 '24 17:03 massifrg

I can't really help unless you can upload a (minimal) file to test with.

jgm avatar Mar 03 '24 18:03 jgm

Here it is (I had to zip it, because I can't upload HTML): test.zip

massifrg avatar Mar 03 '24 18:03 massifrg

Here's a more minimal case:

% pandoc -t djot -f native
Header 2 ("",[],[]) [Str "\224"]
^D
pandoc: Cannot decode byte '\xc3': Data.Text.Encoding: Invalid UTF-8 stream

jgm avatar Mar 04 '24 03:03 jgm

Looking at the changelog for jgm/djoths, I suspect this is due to https://github.com/jgm/djoths/issues/1

* Djot.Blocks: use ByteString directly in `toIdentifier` (#1,
  Vaibhav Sagar).

EDIT: I see the issue https://github.com/jgm/djoths/pull/1/files#r1510547477

jgm avatar Mar 04 '24 03:03 jgm