geojson icon indicating copy to clipboard operation
geojson copied to clipboard

Improve serialization memory footprint

Open rmanoka opened this issue 3 years ago • 8 comments

Currently, we convert Geojson to a Map<String, serde_json::Value> to serialize. This seems to be quite wasteful with memory (and probably also compute-time). For instance, when serializing a large set of geoms, the serialization alone takes up more than 10x the memory required to store the final representation.

I suggest we move out of this approach, in favour of a direct serialization via derive macros / custom impl.. From a quick skim, it looks like only Value requires a custom impl, and the rest can be done via serde_derive (thanks to the flexibility available therein).

Once we move to this, we could also get rid of the JsonObject and all related TryFrom / From impls. Note that serde_json already provides {to,from}_value that can convert any serializable type to/from JsonValue. Is there any additional value (no pun intended) in keeping the impls.?

One minor concern: the docs suggest that serde is optional, but to the best of my understanding, it is a very crucial part of the crate. Am I missing something?

(Issue summarized from discord discussion.)

rmanoka avatar Jan 17 '22 17:01 rmanoka

I can't picture how this looks, but this sounds like a solid approach and I'm happy to help once I know what to do (and pending the resolution of the serde question, also mentioned in #172

urschrei avatar Jan 18 '22 15:01 urschrei

@urschrei The idea is to use the serde_derive macros and adjust the #[serde()] attributes on the data-structures to align the output with the specs.

So, for instance, the main enum GeoJson would now use the internally-tagged repr.:

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum GeoJson {
    Geometry(Geometry),
    Feature(Feature),
    FeatureCollection(FeatureCollection),
}

The FeatureCollection would look something like (haven't checked foreign members and bbox yet):

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct FeatureCollection {
    /// Bounding Box
    ///
    /// [GeoJSON Format Specification § 5](https://tools.ietf.org/html/rfc7946#section-5)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub bbox: Option<Bbox>,
    pub features: Vec<Feature>,
    /// Foreign Members
    ///
    /// [GeoJSON Format Specification § 6](https://tools.ietf.org/html/rfc7946#section-6)
    #[serde(skip_serializing_if = "Option::is_none", flatten)]
    pub foreign_members: Option<JsonObject>,
}

These two already fix the memory issues (which is mainly caused by large FeatureCollections)

rmanoka avatar Jan 19 '22 07:01 rmanoka

So, error handling of serde_json (and may be in general in serde) is a bit inflexible than I hoped. Custom errors are reported in impl Serialize, Deserialize only via conversion to a string (see here). This means, we lose the ability to keep our Error enum for semantic errors (like Feature is missing "type", etc.); we can only pass the error message as a string.

I had hoped it would allow Box<dyn Something> that can be down-casted back by the user to our error enum. This means, that we have to choose between our error enum, and a memory-efficient {de-,}serializer. :(

The serde_json issue on memory usage also points to a memory-efficient Value alternative: ijson. We could just impl. conversions from and into this type so users can opt for this as an alternative. The crate does use some pretty crazy unsafe-code to make the type exactly pointer sized. It also does mean that we have to repeat all our conversion code twice unfortunately.

rmanoka avatar Jan 21 '22 17:01 rmanoka

@rmanoka are you sure about that? I interpret https://serde.rs/error-handling.html as a "data format" can define it's own Error.

bjornharrtell avatar Mar 19 '22 22:03 bjornharrtell

The data format can indeed define their own error type; here data format we intend to use is serde_json (via the general serde traits). However, what we want is to somehow use our existing error struct (i.e. geojson::Error), through the serde framework; this doesn't seem possible without conversion into strings and back.

rmanoka avatar Mar 20 '22 13:03 rmanoka

@rmanoka I think I understand, so I guess the more "optimal" way would be to implement a more serde native serde_geojson. But that is probably another project completely. However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

bjornharrtell avatar Mar 22 '22 17:03 bjornharrtell

However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

I'm inclined to agree – in the case of GeoJSON handling, an order-of-magnitude increase is worth the cost of less informative error messages IMO.

urschrei avatar Mar 22 '22 17:03 urschrei

The existing API's serializing a FeatureCollection directly is still expensive serde_json::to_string(FeatureCollection), but there are now alternatives like FeatureWriter and FeatureReader which should work for many use cases, while consuming much less memory.

see #199, #205, and #206

michaelkirk avatar Sep 02 '22 22:09 michaelkirk