SharpZipLib icon indicating copy to clipboard operation
SharpZipLib copied to clipboard

Tar methods wrongly use the term "ASCII" for non-encoded string bytes

Open brcaswell opened this issue 4 years ago • 3 comments

https://github.com/icsharpcode/SharpZipLib/blob/fec479c2e1a2c7cd58ba8319450901fe40eb070f/src/ICSharpCode.SharpZipLib/Tar/TarHeader.cs#L1088

Steps to reproduce

  1. see https://dotnetfiddle.net/lATEXU
public static void Main()
{
	var s1_UTF_16 = "ABC网络D";
	var b1_UTF_16 = Encoding.Unicode.GetBytes(s1_UTF_16);
	var buffer_UTF_16_default = new byte[b1_UTF_16.Length];
	var buffer_UTF_16_ASCII = new byte[b1_UTF_16.Length];
	var buffer_UTF_16_UTF8 = new byte[b1_UTF_16.Length];
	var buffer_UTF_16 = new byte[b1_UTF_16.Length];
		
	GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_default, 0, buffer_UTF_16_default.Length, null);
	GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_ASCII, 0, buffer_UTF_16_ASCII.Length, Encoding.ASCII);
	GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_UTF8, 0, buffer_UTF_16_UTF8.Length, Encoding.UTF8);
	GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16, 0, buffer_UTF_16.Length, Encoding.Unicode);
		
	var defaultBufferedString = Encoding.ASCII.GetString(buffer_UTF_16_default);
	var asciiBufferedString = Encoding.ASCII.GetString(buffer_UTF_16_ASCII);
	var utf8BufferedString = Encoding.UTF8.GetString(buffer_UTF_16_UTF8);
	var utf16BufferedString = Encoding.Unicode.GetString(buffer_UTF_16);
		
	WriteToConsole(nameof(defaultBufferedString), defaultBufferedString);
	WriteToConsole(nameof(asciiBufferedString), asciiBufferedString);
	WriteToConsole(nameof(utf8BufferedString), utf8BufferedString);
	WriteToConsole(nameof(utf16BufferedString), utf16BufferedString);
		
	var b_asciiConvertResult = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, b1_UTF_16);
	var s_asciiConvertResult = Encoding.ASCII.GetString(b_asciiConvertResult);
		
	WriteToConsole(nameof(s_asciiConvertResult), s_asciiConvertResult);
		
	void WriteToConsole<T>(string nameOf, T value) => Console.WriteLine($"{nameOf} => \"{value}\"");
}

Expected behavior

Calling GetAsciiBytes and passing parameter encoder null should behave the same as passing Encoding.ASCII

Actual behavior

Calling GetAsciiBytes and passing parameter encoder null converts some non-ASCII characters into ASCII characters without encoding

Version of SharpZipLib

Obtained from (only keep the relevant lines)

  • Copied method from GitHub

Also, you may want to reconsider the name of this helper method

brcaswell avatar Jan 22 '21 19:01 brcaswell

Why was this expected? If you are referring to the comment for the encoding argument, it should perhaps be updated to better reflect that supplying no encoding means "no encoding conversion / just use the lower 8 bits".

This is because of backwards compatibility, as this was what it did previously.

piksel avatar Jan 22 '21 19:01 piksel

right, the merit to the expectation is the references to 'ASCII only' and 'non-ASCII' throughout the codebase (including the obsolete attribute message on several constructors of types it seems), as well as parameter description of encoder here (and perhaps the method name).

I'm not sure whether the backward compatibility to this lower 8 bit copy behavior is something that should be preserved (that's a product support concept); But, yes, I do think a task to clarify what null or default value of encoder actually does is warranted.


note that this concept of default encoding actually became a point of confusion in a recent SO question and answer (ref https://stackoverflow.com/q/65847723/1366179 )

brcaswell avatar Jan 22 '21 20:01 brcaswell

What "non-ASCII" and "ASCII only" is trying to convey, is that characters that are outside the "ASCII range" is not guaranteed to be preserved, and that you should make a conscious of how to handle those bytes (by specifying an Encoding).

Now, the names of the methods were both wrong before, when there were no attempts of encoding, and they are now. But since they were only used internally inside the header parser for Tar archives, I didn't figure renaming them was critical, as the PR was already kind of big.

piksel avatar Jan 22 '21 21:01 piksel