runtime TarWriter uses ASCII to write down fields and should use UTF8 instead

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:

cc @carlossanlop @stephentoub @danmoseley @tmds

Sep 12 '22 20:09 jozkee

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:

cc @carlossanlop @stephentoub @danmoseley @tmds

Author:	Jozkee
Assignees:	-
Labels:	`area-System.IO`
Milestone:	8.0.0

Sep 12 '22 20:09 ghost

Should we fix for 7.0? What is the risk of breaking something else if we fix this?

Sep 12 '22 21:09 danmoseley

https://github.com/dotnet/runtime/pull/75373#discussion_r967650446

What is the impact if we don't do this? Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

There's a high liklihood file names will contain non-ASCII characters.

What do other tools do? Why did we start with ASCII?

Sep 12 '22 21:09 stephentoub

Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

I tried opening with 7Zip an archive created with TarFile.CreateFromDirectory and that worked just fine.

What do other tools do?

I tried BDS Tar on WSL Ubuntu with all the supported formats and all of them encode the name as UTF8.

What is the impact if we don't do this?

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post. Keep in mind that Pax is the default.

Sep 12 '22 21:09 jozkee

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post.

But they'll be garbled not just if someone inspects the value in a debugger but also if any other non-.NET tool tries to untar them, right? e.g. they'll produce garbage names in the file system?

Keep in mind that Pax is the default.

Sure... and we make it easy to do something other than the default.

Sep 12 '22 21:09 stephentoub

Is there any impact on reading archives using our API, created by other Tar tools, and using non-ASCII metadata?

Sep 12 '22 21:09 danmoseley

What is the risk of breaking something else if we fix this?

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

Sep 12 '22 21:09 jozkee

they'll produce garbage names in the file system?

Right, even our own TarReader will read them back as garbled as I show in the snippet in the original post.

Sep 12 '22 21:09 jozkee

they'll produce garbage names in the file system?

Right

Then this needs to be fixed for 7.0.

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

It won't be. Several code paths assume that the input and output lengths will be the same; that's a valid assumption for Encoding.ASCII but not for Encoding.UTF8.

Sep 12 '22 21:09 stephentoub

You'll want to include chars that encode as more than 2 bytes, too, not just földër (föld€r ? )

We should have tests for every string in our API both reading and writing that use non ASCII chars -- I guess that includes entry name and link name, and extended attributes, possibly user/group although they may be limited by POSIX to a subset of ASCII. We'll need to do interop tests for each value in each format.

Sep 12 '22 21:09 danmoseley