formats der: string support deficiencies

Taking this out from #690 , #691 and few practical thoughts from my side:

The following string tags are unsupported:

[ ] BMPString
[ ] UniversalString
[ ] VisibleString
[ ] GraphicString

Additionally:

[ ] TeletexString lacks support for chars >= 0x80

Additionally current API lacks easy to use unified string API (at least for AnyRef). Currently when parsing the string from AnyRef (e.g. in RDN) one has to manually match all the string tags and to use corresponding foo_string() calls. Instead it might be useful to add the is_string() -> bool and to support generic string wrapper around ::printable_string(), ::utf8_string(), ::ia5_string(), etc.

Jul 30 '22 06:07 lumag

support generic string wrapper around ::printable_string(), ::utf8_string(), ::ia5_string(), etc.

Yes, I mentioned as much in #692. I definitely want to address this before cutting another release of der as there is currently massive amount of duplication and copy-pasted code among the various string types.

There's an internal StrSlice type it might make sense to expose publicly as a common representation of a string. It's a newtype wrapper for &str with an associated [Length].

With this as a common internal representation, a lot of the duplication between e.g. Ia5StringRef, TeletexStringRef, and VideotexStringRef could be eliminated, possibly via Deref coercions to replace the existing (same-shaped) inherent methods with a common undelrying type, with the core differences between the various types being largely relegated to the constructor.

It would be good to figure all of this out before trying to add any additional string types.

Jul 30 '22 13:07 tarcieri

@tarcieri the major issue with the StrSlice that I faced while trying to implement TeletexString in a more correct way (handling byte >= 0x80 as if it was the win-1252), is that it can not be easily updated to support different as_bytes() implementation (or I could not come up with the way to achieve this). So all the strings currently work because either they are ASCII strings or Utf8Strings. For BMPString we'd need UTF-16BE encoding.

Jul 30 '22 15:07 lumag

@lumag yeah, StrSlice could only work for cases that are subsets of ASCII/UTF-8, but I think it would still make sense to use it for those.

Notably I think Ia5StringRef could be changed to Deref to a StrSlice (replacing the existing inherent methods) without it being a breaking change, particularly as StrSlice isn't currently part of the public API so we can make changes there if necessary.

Jul 30 '22 15:07 tarcieri

Being a minor party here, I'm fine with either approach as long as StrSlice doesn't become the public API.

Jul 30 '22 16:07 lumag

If it were the Deref target for ASCII-like encodings it would have to be

Jul 30 '22 16:07 tarcieri

Can it be replaced with the trait to provide the calling interface?

Jul 30 '22 19:07 lumag

A trait would be an alternative option. You'd still need to retain the inherent methods on Ia5StringRef for now in order to avoid a breaking change, however.

Jul 30 '22 19:07 tarcieri

As a second thought, let's maybe rename StrSlice to AbstractString and use Deref as you suggested? In the end, it might be possible to extend it later to suppor custom byte representations.

Jul 30 '22 20:07 lumag

This crate generally uses the *Ref naming convention for references, leaving *String for the owned types, ala std::string::String.

StrSlice is used all over the place in an AsRef<str> capacity, so it's not going to be extensible to something which doesn't have a straightforward infallible conversion to &str.

Jul 30 '22 20:07 tarcieri

AsRef<str> would stay as is. It's the AsRef<[u8]>::as_ref(&self), who would require it to be changed.

Jul 30 '22 20:07 lumag

... if different byte encodings are to be supported.

Jul 30 '22 20:07 lumag

I really think the types which represent encodings which aren't subsets of ASCII/UTF-8 should be distinct from each other.

The CHOICE types that operate over them can map to enums, and you can have a method which returns a Cow<'a, str> which transcodes if necessary.

Jul 30 '22 20:07 tarcieri

@tarcieri I have to admit, that this is out of my level of Rust knowledge.

Jul 30 '22 20:07 lumag

Cow abstracts over the owned and borrowed forms, so the encodings which are subsets of ASCII/UTF-8 can be represented using a simple reference conversion, and UTF-16 transcoded.

Separately I'm curious if applying RFC6818 deprecations could reduce the total number of string types that need to be implemented to support X.509.

Jul 30 '22 21:07 tarcieri

There is a usual problem with deprecations. It is typical to have a set of data generated with the previous standards in mind, e.g. older digitally signed and/or encrypted documents. So, I've always assumed that if the data/type/tag was allowed to be present at some point in the past, the software should still be able to accept it.

Jul 30 '22 22:07 lumag

I opened a proposal for generic string support in #1061.

For BMPString to work with a type which is String internally we'd have to transcode UCS-2(?) to UTF-8 when decoding and transcode back to UCS-2 when encoding.

Or we could store the raw UCS-2, but that wouldn't be able to leverage a generic implementation as easily. I think the transcoding penalty is worth the convenience of being able to have a common generic string type.

May 16 '23 20:05 tarcieri

Historically I think it was impossible, because all Strings were referenced. Now with the addition of owned structures, it should be possible to transcode UCS-2 <-> UTF-8. In the worst case, we might end up with transcoding Owned implementation and raw-UCS-2 Ref implementation.

May 16 '23 23:05 lumag

I've added BmpString which is the most pressing.

Let's track the others in #1061.

Aug 18 '24 17:08 tarcieri

formats formats copied to clipboard

der: string support deficiencies

formats
formats copied to clipboard