geojson
geojson copied to clipboard
Improve serialization memory footprint
Currently, we convert Geojson
to a Map<String, serde_json::Value>
to serialize. This seems to be quite wasteful with memory (and probably also compute-time). For instance, when serializing a large set of geoms, the serialization alone takes up more than 10x the memory required to store the final representation.
I suggest we move out of this approach, in favour of a direct serialization via derive macros / custom impl.. From a quick skim, it looks like only Value
requires a custom impl, and the rest can be done via serde_derive (thanks to the flexibility available therein).
Once we move to this, we could also get rid of the JsonObject
and all related TryFrom
/ From
impls. Note that serde_json
already provides {to,from}_value
that can convert any serializable type to/from JsonValue
. Is there any additional value (no pun intended) in keeping the impls.?
One minor concern: the docs suggest that serde is optional, but to the best of my understanding, it is a very crucial part of the crate. Am I missing something?
(Issue summarized from discord discussion.)
I can't picture how this looks, but this sounds like a solid approach and I'm happy to help once I know what to do (and pending the resolution of the serde question, also mentioned in #172
@urschrei The idea is to use the serde_derive
macros and adjust the #[serde()]
attributes on the data-structures to align the output with the specs.
So, for instance, the main enum GeoJson
would now use the internally-tagged repr.:
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum GeoJson {
Geometry(Geometry),
Feature(Feature),
FeatureCollection(FeatureCollection),
}
The FeatureCollection
would look something like (haven't checked foreign members and bbox yet):
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct FeatureCollection {
/// Bounding Box
///
/// [GeoJSON Format Specification § 5](https://tools.ietf.org/html/rfc7946#section-5)
#[serde(skip_serializing_if = "Option::is_none")]
pub bbox: Option<Bbox>,
pub features: Vec<Feature>,
/// Foreign Members
///
/// [GeoJSON Format Specification § 6](https://tools.ietf.org/html/rfc7946#section-6)
#[serde(skip_serializing_if = "Option::is_none", flatten)]
pub foreign_members: Option<JsonObject>,
}
These two already fix the memory issues (which is mainly caused by large FeatureCollection
s)
So, error handling of serde_json
(and may be in general in serde
) is a bit inflexible than I hoped. Custom errors are reported in impl Serialize, Deserialize
only via conversion to a string (see here). This means, we lose the ability to keep our Error
enum for semantic errors (like Feature
is missing "type", etc.); we can only pass the error message as a string.
I had hoped it would allow Box<dyn Something>
that can be down-casted back by the user to our error enum. This means, that we have to choose between our error enum, and a memory-efficient {de-,}serializer. :(
The serde_json
issue on memory usage also points to a memory-efficient Value alternative: ijson. We could just impl. conversions from and into this type so users can opt for this as an alternative. The crate does use some pretty crazy unsafe-code to make the type exactly pointer sized. It also does mean that we have to repeat all our conversion code twice unfortunately.
@rmanoka are you sure about that? I interpret https://serde.rs/error-handling.html as a "data format" can define it's own Error.
The data format can indeed define their own error type; here data format we intend to use is serde_json
(via the general serde
traits). However, what we want is to somehow use our existing error struct (i.e. geojson::Error
), through the serde
framework; this doesn't seem possible without conversion into strings and back.
@rmanoka I think I understand, so I guess the more "optimal" way would be to implement a more serde native serde_geojson
. But that is probably another project completely. However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.
However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.
I'm inclined to agree – in the case of GeoJSON handling, an order-of-magnitude increase is worth the cost of less informative error messages IMO.
The existing API's serializing a FeatureCollection directly is still expensive serde_json::to_string(FeatureCollection)
, but there are now alternatives like FeatureWriter
and FeatureReader
which should work for many use cases, while consuming much less memory.
see #199, #205, and #206