libs-team
libs-team copied to clipboard
`impl Display for CStr`
Proposal
Problem statement
In multi-language projects that need to perform interop with C or C++, you will often need to manipulate most of your strings using the core::ffi::CStr type rather than the usual str type, because you need the string to be nul-terminated before you can pass it into the C/C++ codebase. This means that working with nul-terminated strings should be somewhat convenient.
Unfortunately, it's currently very difficult to print a nul-terminated string. The type does not implement Display and there's no display() function either. Currently, you have to do something like this to print it:
println!("{}", my_string.to_string_lossy());
Or use an extension trait for CStr.
What about Debug?
It doesn't print the right thing.
println!("{:?}", c"Hello, here is a string: \n\"\'\xff.");
"Hello, here is a string: \n\"\'\xff."
It wraps the string in quotes, and also escapes various characters such as quotes and newlines.
Motivating examples or use cases
This is motivated by the Linux Kernel, where we currently have a custom CStr type that implements Display. We would like to move away from the custom CStr so that we can use c"" literals. However, there are many places where we print a nul-terminated string for various reasons.
dev_debug!("Registered {name}.");
This becomes:
use kernel::prelude::*; // for CStrExt
dev_debug!("Registered {}.", name.display());
Ideally we would like to be able to continue printing strings with the original syntax.
We don't really care about utf-8 here, and the thing we are printing to doesn't either. Printing replacement characters like how String::from_utf8_lossy would be fine.
Solution sketch
The solution sketch is to implement Display for CStr and CString. The implementation would be the same as for the ByteStr type.
Alternatives
The primary question to consider here is how to deal with bytes that are not valid utf-8.
Add a .display() function
The Path type has a similar problem, but the standard library has chosen to handle it by adding a .display() method. The Path type does not implement the Display trait directly.
I think that we should not repeat this solution for CStr. The Path type carries with it the intent that you are being careful with your paths and that you want to avoid accidentally breaking a path by round-tripping it through the String type. The CStr type does not carry the same intent with it, and implementing Display is more convenient because it lets you print with the "{name}" syntax instead of having to do "{}", name.display().
How to escape the string
This proposal says that the default way to print a CStr should be to use replacement characters rather than some other form of escaping. This is because cstr printing is generally used to display the string to a user, and � is much better at conveying that the data is invalid than \xff to a non-technical user.
Links and related work
Zulip thread: zulip
Adding OsStr::display: tracking issue
LKML thread: lkml
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
The new ByteStr type currently implements Display in the way you suggest here: https://doc.rust-lang.org/nightly/std/bstr/struct.ByteStr.html#trait-implementations-1
I'm unsure about the alternate Display impl here.
Is there a semantic difference between ByteStr and CStr? I think the Display impl for ByteStr is fine because the type is explicitly documented to be conventionally UTF-8. So there's an expectation that the data is "string-like," and more specifically, UTF-8. That is in contrast to &[u8] which carries no such connotation. And indeed, the Display impl on ByteStr (along with its Debug impl) is part of the point of opting into using a type like ByteStr.
Does a CStr have a similar broadly applicable connotation? I think so, although I'm not as steeped in C idioms. And I'm unsure about the UTF-8 expectation.
Oh I had not heard of ByteStr. Yes, I think ByteStr has exactly the same semantics as CStr. In fact, perhaps CStr should Deref to ByteStr?
I updated the solution sketch to say that this shares the impl with ByteStr.
I kinda love the idea of a Deref impl here.
And yes, I think the Display impls should match, other than omitting the trailing NUL in the case of CStr/CString.
CStr -> ByteStr would be an expensive operation (strlen) if the plan of making &CStr a thin pointer without length metadata is ever carried out. It’s not the first CStr API that would from O(1) to O(n) if length is no longer stored, but since Deref is often invoked implicitly, it would be by far the hardest to consciously avoid.
Personally I think the probability of “thin pointer CStr” change ever happening is already low and getting lower every year but a Deref impl feels like a novel nail in the coffin.
Does a
CStrhave a similar broadly applicable connotation? I think so, although I'm not as steeped in C idioms. And I'm unsure about the UTF-8 expectation.
C's char has an implementation-defined encoding and can be controlled with -fexec-charset and also pragmas, it looks like Clang and GCC default to UTF-8 for literals, but on Windows uses is the default code page. char8_t is the newer type intended to signify UTF-8, corresponding to string literals with the u8"..." prefix.
It's likely best to say that our CStr is just a bag of nonzero bytes terminated by a null, without any conventional encoding. But we already provide API to make life easier when the bytes happen to be UTF-8 or UTF-8-like with conversions to and from &str / String, and a Display implementation seems like it would simply be an additional case of this. At least, I don't think that having a Display implementation that assumes something similar to UTF-8 punishes other uses of CStr that would need an external conversion routine anyway.
https://internals.rust-lang.org/t/pre-rfc-deprecate-and-replace-cstr-cstring/5016/38 has some more details about C encodings (that entire thread has some good comments).
Also related, regarding Display round trips https://github.com/rust-lang/rust/pull/136687.
Related PR: https://github.com/rust-lang/rust/pull/138498
Given the concerns about &CStr becoming a thin pointer, I think my PR is a no-go. Would the next logical step be to open a PR implementing Display for CStr behind a feature gate?
Trait impls on stable types are insta-stable themselves, so that would need to go through FCP.
But the next step here is for us to sign off on this ACP. Then PR. Then that has to go through FCP because it's insta stable.
A Display impl for CStr seems reasonable to me. I'll accept this. Once a tracking issue is created, we can close this.
One concern for stabilization is whether this should block on ByteStr (and its Display impl) becoming stable. Particularly if we want those Display impls to be consistent with one another.
@BurntSushi thanks. Are more sign offs required?
The feature lifecycle doc describes tracking issues as tracking feature stabilization, but since this would be insta-stable, I presume the tracking issue would be used to hold the FCP?
If my understanding is correct, the next steps are to create a PR and a tracking issue, then start the FCP on the tracking issue. Once the FCP is concluded, the PR can be merged. Is that correct?
Yes one is enough.
I think just a tracking issue is enough for now? I don't think a PR would hurt anything though.
I can never keep track of whether we do FCP on the tracking issue or the PR. I think in practice we do whatever is convenient or wherever most of the discussion coalesces.
We discussed this in the @rust-lang/libs-api meeting and couldn't come to a consensus on adding a Display impl to CStr, but we did agree to add a display method just like Path and OsStr. The main reason for rejecting a Display impl is that the conversion is lossy, which is not the case for all other string types that implement Display. This was the original reasoning for adding a display method instead of implementing the Display trait for Path and OsStr, and this also applies to CStr.
While not stable, impl Display for ByteStr is also lossy. Does that mean ByteStr will need to move to a display method?
cc @joshtriplett
ByteStr was discussed in the meeting, its purpose is to explicitly opt into treating maybe-utf8 as utf8-or-replacement output, so ByteStr::new(value) similar to calling value.display().
🤔... perhaps we could generalize ByteStr to wrap more types than [u8]?
Some more libs-api discussion happened: We may want to reassess the "thin pointer Cstr" situation (in a separate topic) and commit to it being a fat pointer, then if the outcome for that is affirmative then we don't need a display() method and can use a O(1) conversion to ByteStr instead as the way to express the "probably utf8" assumption. The display() methods predate the ByteStr proposal, so we don't necessarily have to repeat adding those.
I agree with adding a .display() method here.
@Darksonn I know you argued against a .display() method specifically, but I do agree with other libs-api members. Specifically, the lossiness can be quite troublesome. In particular, because of the ToString blanket impl, if CStr implements Display, then you have cstring.to_string() as something that works infallibly, but provides a lossy result. And this seems non-obvious to me.
Would a .display() method let you remove your custom CStr type? (Or rather, would it remove one of your blockers from using the custom CStr type?) If not, it may be the case that a custom CStr type is appropriate for your use case, where an assumption of UTF-8 lossiness is explicitly okay.
No. The main reason we still have a custom CStr type today is so that we can write println!("{my_string}") when we have a c string. After all, we could add a .display() method to core::ffi::CStr ourselves by adding an extension trait to our custom prelude.
In particular, because of the
ToStringblanket impl, ifCStrimplementsDisplay, then you havecstring.to_string()as something that works infallibly, but provides a lossy result. And this seems non-obvious to me.
Note that it was suggested earlier in this thread to make CStr deref to ByteStr, but this logic also implies that we can't do that. Otherwise cstring.to_string() would go through deref to call ByteStr::to_string.
I've been following a discussion at my company of whether or not a Rust interop type for C++ std::string should implement Display. The consensus seems to be that it should not implement Display because it presents a footgun, particularly because of the ToString blanket impl. While it can be said that std::string is UTF8 more often than not, there are many cases in our codebase of it being used to store binary data.
That last part is "our fault", but the interaction of Display with ToString is what makes it so damaging, and it is a specter in this discussion too. So I'll be adding something about ToString to my edition wish list.
EDIT: Moved to https://github.com/rust-lang/rust/issues/134915#issuecomment-2899431966
@tmandry The entire point of ByteStr is to be "conventionally UTF-8". The type exists to serve that role. It is already widely used in the ecosystem as "bstr::BStr", including the behavior of Display and Debug. (Also, the right place for ByteStr discussion is on its own tracking issue.)
I agree that to_string is not a great name. That should never have been based on Display. But that's a universal problem, and would also apply if someone called to_string on some other type that didn't convert losslessly to a string.