LargeFileSplitter
LargeFileSplitter copied to clipboard
When splitting a UTF-8 file, non-ASCII characters become garbage
I just tried to split a CSV file saved in Windows-1257 character set, and it distorts data upon saving: each non-ASCII character in the output file was turned into the following UTF-8 character: �. When I load the resulting file as Windows-1257, I see the following three bytes instead of every non-ASCII character: ļæ½. My guess is that the original file was read and the resulting files were saved as UTF-8, but since the original wasn't in UTF-8 in the first place, every "invalid" character was turned into an U+FFFD REPLACEMENT CHARACTER upon reading, thus killing any hope that the resulting files could be saved correctly. As far as I know, Excel still doesn't allow saving UTF-8 CSV files, so not supporting 8-bit character sets looks like a serious limitation.