icu4x
icu4x copied to clipboard
DataProvider routing, lazy deserialization, caching, and overlays
I wanted to put together an updated, comprehensive model of how different types of data providers interact with one another.
I. Routing
A "routing data provider" or "data router" is one that sends a data request to one or more downstream data providers.
Multi-Blob Data Provider
The multi-blob data provider (#1107) is a specific case. Its data model can be a set of ZeroMaps, and perhaps some metadata to help know which ZeroMap to query for a particular key and locale.
struct MultiBlobDataProvider {
blobs: Vec<Yoke<ZeroMap<str, [u8]>, Rc<[u8]>>>,
metadata: // optional
}
General-Purpose Data Router
The more general case requires using dyn Any as an intermediate. We already have ErasedDataStruct for this purpose. Please note that ErasedDataStruct is a different module with a different purpose than the one that uses erased_serde.
struct DataRouter {
providers: Vec<Box<dyn DataProvider<ErasedDataStruct>>>
}
impl DataProvider<ErasedDataStruct> for DataRouter { /* ... */ }
In order to convert from ErasedDataStruct to a concrete type, we need lazy deserialization.
II. Lazy Deserialization
In #837, I suggest making a data provider that converts from u8 buffers to concrete data structs. Something like:
struct DataDeserializer<P: DataProvider<BufferMarker>> {
provider: P
}
impl<M> DataProvider<M> for DataDeserializer where M::Yokeable::Output: Deserialize {
// ...
}
where BufferMarker is a data struct that has not been parsed yet. BlobDataProvider, FsDataProvider, MultiBlobDataProvider, etc., would all produce BufferMarker.
To go one step further, DataDeserializer could work on ErasedDataStruct as well. It would first attempt to downcast the data struct to the concrete type, and if that fails, it then attempts to downcast to a BufferMarker and deserializes it. (It is unexpected for both both downcasts to fail; in such a case, we would return an error result.)
Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on? The code we currently use is here, where we essentially have Cargo features that turn on or off the different deserializers. We want to avoid using erased_serde in the general case, because of the impact on code size that we discovered. The cargo feature might be the best option for now, because apps should hopefully know which deserializers they need to use at compile time. We could add an option for erased_serde later for apps that don't care as much about code size but want to dynamically load new deserializers at runtime.
III. Caching
The rule of thumb is that there is no such thing as a one-size-fits-all caching solution. Clients have different use cases and resource constraints, which may favor heavy caching, light caching, or no caching at all.
A basic cache would look something like this:
struct LruDataCache<P: DataProvider<ErasedDataStruct>> {
provider: P,
data: LruCache<DataRequest, DataResponse<ErasedDataStruct>>
}
impl<P> LruDataCache<P> {
fn new(max_size: usize, provider: P) -> Self { /* ... */ }
}
Note that we load from a DataProvider but cache a DataResponse.
Depending on whether the cache is inserted before or after the deserializer, the cache could track raw buffers or resolved data. In general, the intent would be that the cache is inserted after the deserializer, such that we keep track of resolved data structs that the app has previously requested.
Open Question: The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe. The alternative would be to make DataProvider work on mutable references instead of shared references.
IV. Overlays
One of the main use cases for chained data providers has been the idea of data overlays.
Until we have specialization, data overlays probably need to operate through the dyn Any code path like caches and general-purpose routers. A data overlay would likely take the following form:
struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
let mut res = self.provider.load_payload(req);
if (/* data overlay conditional */) {
let mut payload: DataPayload<ConcreteType> = res.payload.downcast()?;
// mutate the payload as desired
// question: is there a such thing as downcast_mut() ?
res.payload = payload.upcast();
}
res
}
}
Seeking feedback from:
- [x] @zbraniecki
- [x] @Manishearth
I like this overall plan.
Open Question: How should we configure the deserializers (JSON, Bincode, Postcard, etc) that a DataDeserializer can operate on?
I think cargo feature is the right call here.
The caching data provider needs to mutate itself, but the DataProvider trait works on shared references. I think we should use a mutex-like abstraction to make this thread-safe
The standard pattern in Rust for caches is interior mutability (mutex or refcell). We can use things like Weak/Rc as well to build caches.
That looks good!
I think we should use a mutex-like abstraction to make this thread-safe
I'd go for Mutex.
I'm a bit concerned about your snippet at the end with overlays. The design you propose requires loading payload, modifying it, and then returning.
This differs from what I see as the most important use case, which I'd show as:
struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
if (/* data overlay conditional */) {
load_local_payload(req);
} else {
self.provider.load_payload(req)
}
}
}
and:
struct MyDataOverlay<P: DataProvider<ErasedDataStruct>> {
provider: P,
}
impl<P> DataProvider<ErasedDataStruct> for MyDataOverlay<P> {
fn load_payload(&self, req) -> DataResponse<ErasedDataStruct> {
let mut res = load_local_payload(req);
if (!res.contains(something)) {
res.extend_with(self.provider.load_payload(req));
}
res
}
}
#1369 implements much of the infrastructure for this design to work.
I consider the remaining deliverable for this issue to be tests/examples for the remaining constructions in the OP.
Given CrabBake and the fact that the erased data provider needs a more prominent role, and based on further experience with FFI, here is my updated trait structure.
BufferProvider
A data provider that provides blobs.
Function Signature: fn load_buffer(req: &DataRequest) -> Result<DataResponse<BufferMarker>>
Features:
- FFI friendly (trait object safe)
- Supports deserialization and reading data from a broad spectrum of data sources
Status: Implemented.
AnyProvider
A data provider that provides Rust objects in memory as dyn Any trait objects.
Function Signature: fn load_any(req: &DataRequest) -> Result<AnyResponse>
Features:
- FFI friendly (trait object safe)
- Supports CrabBake, StructProvider, and InvariantDataProvider/UndProvider
Status: Tracked by #1479 and #1494
KeyProvider <M>
A data provider that provides Rust objects for specific data keys.
Function Signature: fn load_key(options: &ResourceOptions) -> Result<DataResponse<M>>
Features:
- DataMarker is in the trait signature
- Works for data transformers
- Can be put in sequence with an AnyProvider for override support
Status: Depends on #570
DataProvider
The core data provider trait that supports all data keys.
Function Signature: fn load_payload<M>(req: &DataRequest) -> Result<DataResponse<M>>
Features:
- This is the trait taken by all
try_newconstructors in Rust - Auto-implemented on BufferProvider and AnyProvider (or on a wrapper struct)
- Supports the caching (lazy deserialization) use case
To-do: make sure everything here is well documented.
Document the following in the data provider tutorial:
- Design is such that caching is not needed, but could be added based on specific client needs
- Examples of how to do overlays/overrides