icu4x DataProvider routing, lazy deserialization, caching, and overlays

trafficstars

I wanted to put together an updated, comprehensive model of how different types of data providers interact with one another.

I. Routing

A "routing data provider" or "data router" is one that sends a data request to one or more downstream data providers.

Multi-Blob Data Provider

The multi-blob data provider (#1107) is a specific case. Its data model can be a set of ZeroMaps, and perhaps some metadata to help know which ZeroMap to query for a particular key and locale.

struct MultiBlobDataProvider {
    blobs: Vec<Yoke<ZeroMap<str, [u8]>, Rc<[u8]>>>,
    metadata: // optional
}

General-Purpose Data Router

The more general case requires using dyn Any as an intermediate. We already have ErasedDataStruct for this purpose. Please note that ErasedDataStruct is a different module with a different purpose than the one that uses erased_serde.

struct DataRouter {
    providers: Vec<Box<dyn DataProvider<ErasedDataStruct>>>
}
impl DataProvider<ErasedDataStruct> for DataRouter { /* ... */ }

In order to convert from ErasedDataStruct to a concrete type, we need lazy deserialization.

II. Lazy Deserialization

In #837, I suggest making a data provider that converts from u8 buffers to concrete data structs. Something like:

struct DataDeserializer<P: DataProvider<BufferMarker>> {
    provider: P
}
impl<M> DataProvider<M> for DataDeserializer where M::Yokeable::Output: Deserialize {
    // ...
}

where BufferMarker is a data struct that has not been parsed yet. BlobDataProvider, FsDataProvider, MultiBlobDataProvider, etc., would all produce BufferMarker.

To go one step further, DataDeserializer could work on ErasedDataStruct as well. It would first attempt to downcast the data struct to the concrete type, and if that fails, it then attempts to downcast to a BufferMarker and deserializes it. (It is unexpected for both both downcasts to fail; in such a case, we would return an error result.)

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on? The code we currently use is here, where we essentially have Cargo features that turn on or off the different deserializers. We want to avoid using erased_serde in the general case, because of the impact on code size that we discovered. The cargo feature might be the best option for now, because apps should hopefully know which deserializers they need to use at compile time. We could add an option for erased_serde later for apps that don't care as much about code size but want to dynamically load new deserializers at runtime.

III. Caching

The rule of thumb is that there is no such thing as a one-size-fits-all caching solution. Clients have different use cases and resource constraints, which may favor heavy caching, light caching, or no caching at all.

A basic cache would look something like this:

struct LruDataCache<P: DataProvider<ErasedDataStruct>> {
    provider: P,
    data: LruCache<DataRequest, DataResponse<ErasedDataStruct>>
}
impl<P> LruDataCache<P> {
    fn new(max_size: usize, provider: P) -> Self { /* ... */ }
}

Note that we load from a DataProvider but cache a DataResponse.

Depending on whether the cache is inserted before or after the deserializer, the cache could track raw buffers or resolved data. In general, the intent would be that the cache is inserted after the deserializer, such that we keep track of resolved data structs that the app has previously requested.

Open Question: The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe. The alternative would be to make DataProvider work on mutable references instead of shared references.

IV. Overlays

One of the main use cases for chained data providers has been the idea of data overlays.

Until we have specialization, data overlays probably need to operate through the dyn Any code path like caches and general-purpose routers. A data overlay would likely take the following form:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = self.provider.load_payload(req);
        if (/* data overlay conditional */) {
            let mut payload: DataPayload<ConcreteType> = res.payload.downcast()?;
            // mutate the payload as desired
            // question: is there a such thing as downcast_mut() ?
            res.payload = payload.upcast();
        }
        res
    }
}

Seeking feedback from:

[x] @zbraniecki
[x] @Manishearth

Oct 31 '21 18:10 sffc

I like this overall plan.

Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on?

I think cargo feature is the right call here.

The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe

The standard pattern in Rust for caches is interior mutability (mutex or refcell). We can use things like Weak/Rc as well to build caches.

Nov 08 '21 22:11 Manishearth

That looks good!

I think we should use a mutex-like abstraction to make this thread-safe

I'd go for Mutex.

I'm a bit concerned about your snippet at the end with overlays. The design you propose requires loading payload, modifying it, and then returning.

This differs from what I see as the most important use case, which I'd show as:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        if (/* data overlay conditional */) {
            load_local_payload(req);
        } else {
            self.provider.load_payload(req)
        }
    }
}

and:

struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
    provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
    fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
        let mut res = load_local_payload(req);
        if (!res.contains(something)) {
            res.extend_with(self.provider.load_payload(req));
        }
        res
    }
}

Nov 08 '21 22:11 zbraniecki

#1369 implements much of the infrastructure for this design to work.

I consider the remaining deliverable for this issue to be tests/examples for the remaining constructions in the OP.

Dec 09 '21 17:12 sffc

Given CrabBake and the fact that the erased data provider needs a more prominent role, and based on further experience with FFI, here is my updated trait structure.

BufferProvider

A data provider that provides blobs.

Function Signature: fn load_buffer(req: &DataRequest) -> Result<DataResponse<BufferMarker>>

Features:

FFI friendly (trait object safe)
Supports deserialization and reading data from a broad spectrum of data sources

Status: Implemented.

AnyProvider

A data provider that provides Rust objects in memory as dyn Any trait objects.

Function Signature: fn load_any(req: &DataRequest) -> Result<AnyResponse>

Features:

FFI friendly (trait object safe)
Supports CrabBake, StructProvider, and InvariantDataProvider/UndProvider

Status: Tracked by #1479 and #1494

KeyProvider `<M>`

A data provider that provides Rust objects for specific data keys.

Function Signature: fn load_key(options: &ResourceOptions) -> Result<DataResponse<M>>

Features:

DataMarker is in the trait signature
Works for data transformers
Can be put in sequence with an AnyProvider for override support

Status: Depends on #570

DataProvider

The core data provider trait that supports all data keys.

Function Signature: fn load_payload<M>(req: &DataRequest) -> Result<DataResponse<M>>

Features:

This is the trait taken by all try_new constructors in Rust
Auto-implemented on BufferProvider and AnyProvider (or on a wrapper struct)
Supports the caching (lazy deserialization) use case

Jan 11 '22 20:01 sffc

To-do: make sure everything here is well documented.

Jul 28 '22 17:07 sffc

Document the following in the data provider tutorial:

Design is such that caching is not needed, but could be added based on specific client needs
Examples of how to do overlays/overrides

Sep 26 '22 18:09 sffc

icu4x icu4x copied to clipboard

DataProvider routing, lazy deserialization, caching, and overlays

I. Routing

Multi-Blob Data Provider

General-Purpose Data Router

II. Lazy Deserialization

III. Caching

IV. Overlays

BufferProvider

AnyProvider

KeyProvider <M>

DataProvider

icu4x
icu4x copied to clipboard

KeyProvider `<M>`