pandoc-placetable icon indicating copy to clipboard operation
pandoc-placetable copied to clipboard

Problem using CSV containing UTF-8 special characters

Open kidahl opened this issue 8 years ago • 9 comments

My CSV is saved in UTF-8 and contains special Norwegian characters (æøå). These end up completely garbled in the generated docx file. The same characters in the .md itself (the file that includes the .csv table) is presented fine in the same docx file.

This is reproduced with other accented characters as well, so it seems to be a codepage conversion problem.

Using pandoc version 1.17.0.4 and latest pandoc-placetable.

kidahl avatar Sep 28 '17 12:09 kidahl

For the character 'Ø' I get '├ÿ' in the DOC, which is 0xC398 in codepage 437 (windows console), which is the bytecode for 'Ø' in UTF-8

kidahl avatar Sep 28 '17 12:09 kidahl

can you try the latest released pandoc? on what operating system are you on?

mb21 avatar Sep 28 '17 12:09 mb21

I'm on Windows 10, I will try updating tommorrow.

kidahl avatar Sep 28 '17 13:09 kidahl

I can confirm that this is still an issue with Pandoc 1.19.2.1 on Windows 10.

I see no obvious error in the output directly from Placetable. With the following csv file:

A;B;C Æ;Ø;Å

I get this:

C:\Projects\eos\workspace\eos\BuildTools>chcp 437 Active code page: 437

C:\Projects\eos\workspace\eos\BuildTools>pandoc-placetable.exe --csv test.csv --delimiter=; {"blocks":[{"t":"Table","c":[[],[{"t":"AlignDefault"},{"t":"AlignDefault"},{"t":"AlignDefault"}],[0,0,0],[[],[],[]],[[[{"t":"Plain","c":[{"t":"Str","c":"A"}]}],[{"t":"Plain","c":[{"t":"Str","c":"B"}]}],[{"t":"Plain","c":[{"t":"Str","c":"C"}]}]],[[{"t":"Plain","c":[{"t":"Str","c":"Æ"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Ø"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Å"}]}]]]]}],"pandoc-api-version":[1,17,0,5],"meta":{}} C:\Projects\eos\workspace\eos\BuildTools>chcp 65001 Active code page: 65001

C:\Projects\eos\workspace\eos\BuildTools>pandoc-placetable.exe --csv test.csv --delimiter=; {"blocks":[{"t":"Table","c":[[],[{"t":"AlignDefault"},{"t":"AlignDefault"},{"t":"AlignDefault"}],[0,0,0],[[],[],[]],[[[{"t":"Plain","c":[{"t":"Str","c":"A"}]}],[{"t":"Plain","c":[{"t":"Str","c":"B"}]}],[{"t":"Plain","c":[{"t":"Str","c":"C"}]}]],[[{"t":"Plain","c":[{"t":"Str","c":"Æ"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Ø"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Å"}]}]]]]}],"pandoc-api-version":[1,17,0,5],"meta":{}}

So it seems to med that Placetable outputs the correct values in UTF-8 when invoked directly. I will therefore post this issue in Pandoc as well, but would appreciate it if you could look into this as well.

kidahl avatar Sep 29 '17 06:09 kidahl

I suspect this has something to do with the windows console... (related: https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/), but I don't know Windows very well (and don't have a system to test).

maybe try the Linux Subsystem on Windows 10..?

mb21 avatar Sep 29 '17 09:09 mb21

It may be, but I do not believe the piping mechanism does codepage conversion on any plattform. It is after all possible to pipe binary data.

Attached is a simple reproduction of the issue:

chcp 437 pandoc --filter pandoc-placetable -o test437.html test.md chcp 65001 pandoc --filter pandoc-placetable -o test65001.html test.md

The output HTML is identical, both are wrong. The example files in UTF-8 and output is attached.

test.zip

kidahl avatar Sep 29 '17 10:09 kidahl

Hm... I tested your input and cannot reproduce the incorrect html output file on macOS...

mb21 avatar Sep 29 '17 11:09 mb21

Hmm, that's not good news. I suspect the issue is Pandoc itself, but my bug report has not had any responses so far.

kidahl avatar Oct 02 '17 07:10 kidahl

Maybe ask around on the pandoc-discuss mailing list... AFAIK few developers of pandoc have windows around to test...

mb21 avatar Oct 02 '17 07:10 mb21