MultiMarkdown-6 icon indicating copy to clipboard operation
MultiMarkdown-6 copied to clipboard

File name with multi-byte character isn't processed properly

Open croce1 opened this issue 7 years ago • 11 comments

When I run the following command with file name included multi-byte character on Command Prompt, Error occurred and wasn't processed properly.

multimarkdown FILENAME Error reading file 'FILENAME'

multimarkdown "FILENAME" Error reading file 'FILENAME'

Where FILENAME is MultiMarkdownの使い方.md.

When I change FILENAME with HowToUseMultiMarkdown.md, it seems to be processed properly.

MultiMarkdown 5 goes well, but MultiMarkdown 6 (6.2.3) doesn't go well.

At MultiMarkdown 6, file name with multi-byte character is NOT supported ?

I use Windows 10 Professional Fall Creators Update, 64bit.

croce1 avatar Jan 11 '18 15:01 croce1

I'll look at it, but as a workaround, you can use the alternative stdin approach. In *nix:

cat file.md | multimarkdown

I don't know off top of my head what the comparable Windows command is.

fletcher avatar Jan 11 '18 16:01 fletcher

At Windows,

type file.md | multimarkdown

goes well! Thank you!

croce1 avatar Jan 11 '18 16:01 croce1

This should be occurring in the same section of code that was previously "fixed" for Windows in order to allow multibyte characters (e.g. ü). Perhaps it chokes on longer (e.g. 3 & 4 byte) characters? (file.c -> scan_file())

I don't have a windows environment with the debugging tools necessary to fix this properly. So I'll leave it open in case anyone has any thoughts for me.

fletcher avatar Jan 11 '18 21:01 fletcher

(bump) -- Wondering if @f8ttyc8t has any thoughts on this, since they were so helpful with the other Windows issue?

fletcher avatar Jan 26 '18 16:01 fletcher

Sorry for being not responsive... will check and return with findings.

@croce1 - would you mind to send me a file causing that troubles? It's content is not of importance but name of file (at best, zip it before delivering). Because I'm not too familiar with UNICODE filenames, I probably need a reference file. And have to debug that gem.

Thanks!

f8ttyc8t avatar Jan 26 '18 17:01 f8ttyc8t

@f8ttyc8t -- You weren't being unresponsive at all, just pulling you in to solve my problem for me... ;)

The filename mentioned was MultiMarkdownの使い方.md. It works fine for me on Mac (I just made a new file with that name and put in fake content).

fletcher avatar Jan 26 '18 17:01 fletcher

@fletcher - thanks for filename, but I think I will need such a file. Just to be sure to work on root cause (on the other hand: I could create such a file on my Mac and test in on my Win-Machine... thanks!)

f8ttyc8t avatar Jan 26 '18 17:01 f8ttyc8t

Unless I am misunderstanding something, the problem is with the filename. Not the contents of the file itself.

fletcher avatar Jan 26 '18 18:01 fletcher

I think I've found the root cause. Application entry point is int main(int argc, char** argv) { which causes file name to become of type char *.

In my tests, multibyte filename information are lost... and replaced by garbage. This causes (especially but not only) arg_parse library to fail.

I think that's the main reason Visual Studio offers TCHAR type and wmain(...) entry points. Supporting that situation may become a tricky job to be done.

But possibly I am entirely wrong... I am simply not smart enough to work with MBCS/Unicode/ANSI.

f8ttyc8t avatar Jan 26 '18 20:01 f8ttyc8t

Hmmmm.....

char * is simply a sequence of bytes, so it is technically capable of storing anything. Which is why it is so easily used in so many places. It allows main() and arg_parse() to work properly with multibyte characters on Mac (and presumably *nix). Curious that it doesn't work on windows. But nothing new there.... ;)

Maybe one day I'll get really bored and try to put together a Windows development environment to be able to dig into this myself, but that seems like such a colossal headache and waste of time.... :(

I will leave this open for future reference. If anyone has a simple change that I should make that fixes this on Windows that does not involve "internally forking" the code into two separate methods, I am happy to include it.

In the meantime, I guess the official answer is to either use ASCII character names, or to use the stdin approach with type file.md | multimarkdown.

Thanks to all!!!

fletcher avatar Jan 26 '18 21:01 fletcher

I'm not familiar with concerning of Windows' programming matter, but in Command Prompt at Japanese version, not Code Page 65001 (UTF-8) but Code Page 932 (so called Shift_JIS) seems to be generally used for file names.

CP932 (Shift_JIS) is well-known that a problem is prone to occur to character that includes character code 0x5C (i.e. backslash).

I don't know which code page is passed to the program (e.g. multimarkdown) as a file name (I guess it's either Shift_JIS or UTF-8), and I don't know how Windows programmers should deal with such a file name either.

So I think it's appropriate to use stdin approach that @fletcher says.

I'll try to attach the ZIP file mmdtmp.zip including MultiMarkdownの使い方.md. File encode of the MultiMarkdown file is UTF-8, and return code is CR LF.

croce1 avatar Jan 27 '18 11:01 croce1