Problem using CSV containing UTF-8 special characters
My CSV is saved in UTF-8 and contains special Norwegian characters (æøå). These end up completely garbled in the generated docx file. The same characters in the .md itself (the file that includes the .csv table) is presented fine in the same docx file.
This is reproduced with other accented characters as well, so it seems to be a codepage conversion problem.
Using pandoc version 1.17.0.4 and latest pandoc-placetable.
For the character 'Ø' I get '├ÿ' in the DOC, which is 0xC398 in codepage 437 (windows console), which is the bytecode for 'Ø' in UTF-8
can you try the latest released pandoc? on what operating system are you on?
I'm on Windows 10, I will try updating tommorrow.
I can confirm that this is still an issue with Pandoc 1.19.2.1 on Windows 10.
I see no obvious error in the output directly from Placetable. With the following csv file:
A;B;C Æ;Ø;Å
I get this:
C:\Projects\eos\workspace\eos\BuildTools>chcp 437 Active code page: 437
C:\Projects\eos\workspace\eos\BuildTools>pandoc-placetable.exe --csv test.csv --delimiter=; {"blocks":[{"t":"Table","c":[[],[{"t":"AlignDefault"},{"t":"AlignDefault"},{"t":"AlignDefault"}],[0,0,0],[[],[],[]],[[[{"t":"Plain","c":[{"t":"Str","c":"A"}]}],[{"t":"Plain","c":[{"t":"Str","c":"B"}]}],[{"t":"Plain","c":[{"t":"Str","c":"C"}]}]],[[{"t":"Plain","c":[{"t":"Str","c":"Æ"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Ø"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Å"}]}]]]]}],"pandoc-api-version":[1,17,0,5],"meta":{}} C:\Projects\eos\workspace\eos\BuildTools>chcp 65001 Active code page: 65001
C:\Projects\eos\workspace\eos\BuildTools>pandoc-placetable.exe --csv test.csv --delimiter=; {"blocks":[{"t":"Table","c":[[],[{"t":"AlignDefault"},{"t":"AlignDefault"},{"t":"AlignDefault"}],[0,0,0],[[],[],[]],[[[{"t":"Plain","c":[{"t":"Str","c":"A"}]}],[{"t":"Plain","c":[{"t":"Str","c":"B"}]}],[{"t":"Plain","c":[{"t":"Str","c":"C"}]}]],[[{"t":"Plain","c":[{"t":"Str","c":"Æ"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Ø"}]}],[{"t":"Plain","c":[{"t":"Str","c":"Å"}]}]]]]}],"pandoc-api-version":[1,17,0,5],"meta":{}}
So it seems to med that Placetable outputs the correct values in UTF-8 when invoked directly. I will therefore post this issue in Pandoc as well, but would appreciate it if you could look into this as well.
I suspect this has something to do with the windows console... (related: https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/), but I don't know Windows very well (and don't have a system to test).
maybe try the Linux Subsystem on Windows 10..?
It may be, but I do not believe the piping mechanism does codepage conversion on any plattform. It is after all possible to pipe binary data.
Attached is a simple reproduction of the issue:
chcp 437 pandoc --filter pandoc-placetable -o test437.html test.md chcp 65001 pandoc --filter pandoc-placetable -o test65001.html test.md
The output HTML is identical, both are wrong. The example files in UTF-8 and output is attached.
Hm... I tested your input and cannot reproduce the incorrect html output file on macOS...
Hmm, that's not good news. I suspect the issue is Pandoc itself, but my bug report has not had any responses so far.
Maybe ask around on the pandoc-discuss mailing list... AFAIK few developers of pandoc have windows around to test...