rust-unic Add allocation free API

I'd like to see a generic "transformation" or "streaming" API that can be used to transform data across the subcrates, possibly without requiring a heap allocation if some mutable storage is already allocated (eg. that can be used for IDNA transformations as well as for normalization, giving the two packages a consistent API). I've recently been playing with adapting the Go text/transform APIs, but as you can imagine it doesn't exactly map cleanly to Rust, being a drastically different language (Initial trait and experimentation can be found here).

Instead, I suspect having some sort of io::Read/io::Write impl would work better. This would allow us to transform into pre-allocated space, as opposed to the current implementation (in IDNA at least) that returns a String which must always be heap allocated when IDNA is called. It might even let us wrap several transforms up into a single object without doing multiple allocations, which would be useful when implementing PRECIS (something I've also been experimenting with lately).

It would be nice to start discussing such an API in this issue if you think such a thing would be desirable. Thanks for your work on this!

Jun 21 '17 18:06 SamWhited

This is very interesting, @SamWhited! I haven't worked with transforms myself, so not sure what are the common practices. Can you provide a snippet of what how such an API would be used?

PS. At the moment, I'm focused on the UCD component, making sure the Character Property API is consistent between components. I think it would take me a bit longer to get to the string-level APIs.

Jun 27 '17 07:06 behnam

I haven't worked with transforms myself, so not sure what are the common practices.

I can't claim to be a Unicode API expert; I'm not even sure if this is a common API design for this sort of thing, but having some form of allocation free API definitely seems desirable since many of these transformations are going to be extremely common, eg. happening on every network request when implementing protocols that support internationalization.

Can you provide a snippet of what how such an API would be used?

One example might be my xmpp-addr crate. XMPP addresses ("Jabber IDs" or, "JIDs" for historical reasons) support internationalization so that your nickname or address can be your name. Eg. in XMPP pierre.bé[email protected] is a valid address, similar to an email address except with the ability to use your name if it's contains non-ASCII characters (although I think most email providers ignore the ASCII requirement and must do some form of canonicalization too).

When comparing to see who in our address book sent a message (in an instant messaging application, for example), we need to do some normalization to make sure that different representations of "é" map to the same canonical JID. We can do that right now of course, but this is a basic low level primitive of the protocol; every single stanza (packet) coming in over the wire probably has two JIDs (to and from), if we have to perform a slow heap allocation several times for every single packet, we're going to be in trouble if designing a realtime system (maybe it doesn't matter so much for chat, but imagine a video signaling protocol, or a stock ticker). However, if we have the option of say using memory from a pool that's already allocated, or transforming the JID in place things are a lot faster.

Jun 27 '17 17:06 SamWhited

Right. I see what you mean.

There's on goal I already have for the string-level API, which is implement the isX family of functionalities, specially a fast isNFC(), which is usually a requirement for API boundaries. (To accept a message, or reject, toNFC(), ... on failure.) So, what would be an allocation free verification method, which is pretty common. (https://github.com/behnam/rust-unic/issues/20)

On another hand, ICU has a Transformation pipelining system, which can create a complex transformation (with UTF-switches, even, IIRC) having a simple string definition for it: http://userguide.icu-project.org/transforms

That's an interesting part which I think can benefit from the macro system for pipeline creation.

Jun 27 '17 22:06 behnam

There's on goal I already have for the string-level API, which is implement the isX family of functionalities

I like that idea too; the Go version uses the concept of a "spanning transformer" for that (a transformer that also has the ability to report the longest span of text that will require no transformation). This means that helper functions that use the transformers can sometimes move much faster by searching for spans first, copying bytes that don't need to be transformed, and then only running the transformation on slices that actually need to be transformed (and "isTransformed" helper functions can just use the spanning transformer under the hood and check that the span that's reported is the length of the original string).

On another hand, ICU has a Transformation pipelining system

I'm not actually all that familiar with ICU, but this sounds similar to how the Go text libraries work.

That's an interesting part which I think can benefit from the macro system for pipeline creation.

I'm not familiar with this; sounds interesting.

Jun 28 '17 18:06 SamWhited

Just to spitball an API design, I've been playing with something like this (mostly stolen and adapted from the Go x/text API by @mpvl):

pub trait Transformer {
    fn transform(
        &mut self, 
        dst: &mut [u8], 
        src: &[u8], 
        eof: bool
    ) -> Result<(usize, usize)>;

    // Might want to make reset part of a separate trait?
    // Then we can tell the difference between stateful transformers (that
    // are actually a ResettableTransformer or something) and stateless
    // transforms which don't need a reset methohd.
    fn reset(&mut self);

    // Maybe have a few default impl helper methods for wrapping readers/writers,
    // chaining transforms, etc; body elided.
    fn chain<T: Transformer>(self, next: T) -> Chain<Self, T>
    where
        Self: Sized,
    { ... }
    fn reader<T: io::Read>(read: T) -> TransformRead<T> { ... }
    fn writer<T: io::Write>(write: T) -> TransformWrite<T> { ... }

// It makes implementing transformers harder, but also can make lots of operations
// faster: maybe this should just be part of the regular Transformer trait and required.
pub trait SpanningTransformer: Transformer {
    fn span(&self, src: &[u8], eof: bool) -> Result<usize>;
}

so, eg. a norm::NFC struct might implement Transform which would also give you a way to get readers/writers that wrap existing readers/writers except they perform NFC on any bytes read.

Jun 28 '17 18:06 SamWhited

rust-unic rust-unic copied to clipboard

Add allocation free API

rust-unic
rust-unic copied to clipboard