LargeFileSplitter icon indicating copy to clipboard operation
LargeFileSplitter copied to clipboard

When splitting a UTF-8 file, non-ASCII characters become garbage

Open rimas-kudelis opened this issue 8 years ago • 0 comments

I just tried to split a CSV file saved in Windows-1257 character set, and it distorts data upon saving: each non-ASCII character in the output file was turned into the following UTF-8 character: �. When I load the resulting file as Windows-1257, I see the following three bytes instead of every non-ASCII character: ļæ½. My guess is that the original file was read and the resulting files were saved as UTF-8, but since the original wasn't in UTF-8 in the first place, every "invalid" character was turned into an U+FFFD REPLACEMENT CHARACTER upon reading, thus killing any hope that the resulting files could be saved correctly. As far as I know, Excel still doesn't allow saving UTF-8 CSV files, so not supporting 8-bit character sets looks like a serious limitation.

rimas-kudelis avatar Mar 06 '16 10:03 rimas-kudelis