Tar methods wrongly use the term "ASCII" for non-encoded string bytes
https://github.com/icsharpcode/SharpZipLib/blob/fec479c2e1a2c7cd58ba8319450901fe40eb070f/src/ICSharpCode.SharpZipLib/Tar/TarHeader.cs#L1088
Steps to reproduce
- see https://dotnetfiddle.net/lATEXU
public static void Main()
{
var s1_UTF_16 = "ABC网络D";
var b1_UTF_16 = Encoding.Unicode.GetBytes(s1_UTF_16);
var buffer_UTF_16_default = new byte[b1_UTF_16.Length];
var buffer_UTF_16_ASCII = new byte[b1_UTF_16.Length];
var buffer_UTF_16_UTF8 = new byte[b1_UTF_16.Length];
var buffer_UTF_16 = new byte[b1_UTF_16.Length];
GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_default, 0, buffer_UTF_16_default.Length, null);
GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_ASCII, 0, buffer_UTF_16_ASCII.Length, Encoding.ASCII);
GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16_UTF8, 0, buffer_UTF_16_UTF8.Length, Encoding.UTF8);
GetAsciiBytes(s1_UTF_16, 0, buffer_UTF_16, 0, buffer_UTF_16.Length, Encoding.Unicode);
var defaultBufferedString = Encoding.ASCII.GetString(buffer_UTF_16_default);
var asciiBufferedString = Encoding.ASCII.GetString(buffer_UTF_16_ASCII);
var utf8BufferedString = Encoding.UTF8.GetString(buffer_UTF_16_UTF8);
var utf16BufferedString = Encoding.Unicode.GetString(buffer_UTF_16);
WriteToConsole(nameof(defaultBufferedString), defaultBufferedString);
WriteToConsole(nameof(asciiBufferedString), asciiBufferedString);
WriteToConsole(nameof(utf8BufferedString), utf8BufferedString);
WriteToConsole(nameof(utf16BufferedString), utf16BufferedString);
var b_asciiConvertResult = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, b1_UTF_16);
var s_asciiConvertResult = Encoding.ASCII.GetString(b_asciiConvertResult);
WriteToConsole(nameof(s_asciiConvertResult), s_asciiConvertResult);
void WriteToConsole<T>(string nameOf, T value) => Console.WriteLine($"{nameOf} => \"{value}\"");
}
Expected behavior
Calling GetAsciiBytes and passing parameter encoder null should behave the same as passing Encoding.ASCII
Actual behavior
Calling GetAsciiBytes and passing parameter encoder null converts some non-ASCII characters into ASCII characters without encoding
Version of SharpZipLib
Obtained from (only keep the relevant lines)
- Copied method from GitHub
Also, you may want to reconsider the name of this helper method
Why was this expected? If you are referring to the comment for the encoding argument, it should perhaps be updated to better reflect that supplying no encoding means "no encoding conversion / just use the lower 8 bits".
This is because of backwards compatibility, as this was what it did previously.
right, the merit to the expectation is the references to 'ASCII only' and 'non-ASCII' throughout the codebase (including the obsolete attribute message on several constructors of types it seems), as well as parameter description of encoder here (and perhaps the method name).
I'm not sure whether the backward compatibility to this lower 8 bit copy behavior is something that should be preserved (that's a product support concept); But, yes, I do think a task to clarify what null or default value of encoder actually does is warranted.
note that this concept of default encoding actually became a point of confusion in a recent SO question and answer (ref https://stackoverflow.com/q/65847723/1366179 )
What "non-ASCII" and "ASCII only" is trying to convey, is that characters that are outside the "ASCII range" is not guaranteed to be preserved, and that you should make a conscious of how to handle those bytes (by specifying an Encoding).
Now, the names of the methods were both wrong before, when there were no attempts of encoding, and they are now. But since they were only used internally inside the header parser for Tar archives, I didn't figure renaming them was critical, as the PR was already kind of big.