geojson icon indicating copy to clipboard operation
geojson copied to clipboard

Deserialize geojson into custom struct using serde

Open michaelkirk opened this issue 2 years ago • 7 comments

  • [x] I agree to follow the project's code of conduct.
  • [x] I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

This is an attempt to improve the ergonomics of parsing GeoJson using serde. The API is somewhat inspired by my usage of the csv crate

This PR is a draft because there are a lot of error paths related to invalid parsing that I'd like to add tests for, but I first wanted to check in on overall direction of the API. What do people think?

Here's what it looks like...

Given some geojson like this:

let feature_collection_string = r#"{
    "type": "FeatureCollection",
    "features": [
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [125.6, 10.1]
           },
           "properties": {
             "name": "Dinagat Islands",
             "age": 123
           }
        },
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [2.3, 4.5]
           },
           "properties": {
             "name": "Neverland",
             "age": 456
           }
         }
   ]
}"#
.as_bytes();

let io_reader = std::io::BufReader::new(feature_collection_string);

You used to parse it like this:

use geojson:: FeatureIterator;
let feature_reader = FeatureIterator::new(io_reader);
for feature in feature_reader.features() {
    let feature = feature.expect("valid geojson feature");

    let name = feature.property("name").unwrap().as_str().unwrap();
    let age = feature.property("age").unwrap().as_u64().unwrap();
    let geometry = feature.geometry.value.try_into().unwrap();

    if name == "Dinagat Islands" {
        assert_eq!(123, age);
        assert_matches!(geometry, geo_types::Point::new(125.6, 10.1).into());
    } else if name == "Neverland" {
        assert_eq!(456, age);
        assert_matches!(geometry, geo_types::Point::new(2.3, 4.5).into());
    } else {
        panic!("unexpected name: {}", name);
    }
}

But now you also have the option of parsing it into your own custom struct using serde like this:

use geojson::deserialize::deserialize_geometry;
use geojson::FeatureReader;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct MyStruct {
    // You can parse directly to geo_types if you want, otherwise this will be a `geojson::Geometry`
    #[serde(deserialize_with = "deserialize_geometry")]
    geometry: geo_types::Geometry<f64>,
    name: String,
    age: u64,
}

let feature_reader = FeatureReader::from_reader(io_reader);
for feature in feature_reader.deserialize::<MyStruct>().unwrap() {
    let my_struct = feature.expect("valid geojson feature");

    if my_struct.name == "Dinagat Islands" {
        assert_eq!(my_struct.age, 123);
        assert_matches!(my_struct.geometry, geo_types::Point::new(125.6, 10.1).into());
    } else if my_struct.name == "Neverland" {
        assert_eq!(my_struct.age, 456);
        assert_matches!(my_struct.geometry, geo_types::Point::new(2.3, 4.5).into());
    } else {
        panic!("unexpected name: {}", my_struct.name);
    }
}

Related to https://github.com/georust/geojson/issues/184

michaelkirk avatar Aug 02 '22 01:08 michaelkirk

I like this a lot! The only possible feature I can think of at the moment is the ability to have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time (I could see that being tricky to implement).

I would have definitely liked something like this when I was doing geospatial work last summer.

mapoulos avatar Aug 10 '22 00:08 mapoulos

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

That's a great idea. I think I can add that pretty easily actually. Thanks for looking!

michaelkirk avatar Aug 10 '22 01:08 michaelkirk

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

I've pushed up another commit that does just this.

michaelkirk avatar Aug 10 '22 16:08 michaelkirk

I like the idea a lot. Do you have in mind to add serialization capabilities as well? I was wondering how a roundtrip would look like in regards to foreign members. Could they be preserved with the current approach to potentially achieve a lossless roundtrip?

b4l avatar Aug 10 '22 19:08 b4l

Thanks for looking @b4l!

Could [foreign members] be preserved with the current approach to potentially achieve a lossless roundtrip?

Not with the current approach. An example of the current behavior to highlight this:

{ 
   "type": "Feature",
    "foo": "bar",   // <-- foreign member "foo"
    properties: {
        "foo": 123  // <-- property "foo"
    }
}

The approach in this PR flattens properties into the top level of the struct, and simply ignores any foreign members:

MyStruct {
    geometry: Geometry,
    foo: u64
}

I can see why this behavior might not work for some use cases. The use case driving this PR is to have processing code that is agnostic (or requires only minimal changes) between formats. e.g. I'd like to have code that can read it's input from either GeoJSON or CSV (and some day maybe other formats as well) while only changing a couple lines of code.

Having multiple sources of inputs requires a sort of "lowest common denominator" of what can be accounted for, and foreign members fall outside of that.

If you did want to keep all the structural information, like foreign members, you can still do that same as always. You have to deserialize to a geojson::Feature or maybe even just a json::Object depending on your use cases.

It's also possible that if we threw enough proc_macro at the problem that we could come up with something more powerful, but I don't plan to work on that just yet.

michaelkirk avatar Aug 10 '22 20:08 michaelkirk

Do you have in mind to add serialization capabilities as well?

Oh sorry I forgot to answer this in my original reply. If this seems like a good overall direction, I'm happy to add serialization.

michaelkirk avatar Aug 10 '22 20:08 michaelkirk

Thanks for the explanations @michaelkirk !

My two potential use cases I have in mind are enforcing a certain schema with a concrete type/struct and support for https://github.com/opengeospatial/ogc-feat-geo-json.

This PR gives us the ability to enforce a concrete geometry type and a certain properties struct. I don't really know how important my use cases are, just wanted to mention them to raise awareness.

b4l avatar Aug 10 '22 21:08 b4l

This feature would be brilliant!

metasim avatar Aug 18 '22 19:08 metasim

This is indeed a fantastic idea: really flexible and ergonomic conversion to/from geojson! Thanks @michaelkirk

rmanoka avatar Aug 22 '22 14:08 rmanoka

Thanks for the feedback everybody - I think this is ready for review!

Do you have in mind to add serialization capabilities as well?

I've gone and done this and made some other substantial changes, error handling, and docs. It's a big diff, but a lot of it is tests and docs.

I've updated the description with some caveats. Please review them and let me know what you think.

Especially interesting to me would be if you're willing to integrate this into an actual use case. I'm curious to see what kind of rough edges surface. I'm happy to help with the integration if you want to ping me.

Also, I'd super appreciate if anyone has time to build and review the new docs.

michaelkirk avatar Aug 25 '22 05:08 michaelkirk

The ser. support is awesome! It should address #176 too, right?

closes #176

rmanoka avatar Aug 25 '22 05:08 rmanoka

The ser. support is awesome! It should address https://github.com/georust/geojson/issues/176 too, right?

Unfortunately, it's probably not much of a performance improvement at this point. In fact, it might be worse in some ways.

For one, there's this really unfortunate PERF note here: https://github.com/georust/geojson/pull/199/files#diff-02292e8d776b8c5c924b5e32f227e772514cb68b37f3f7b384f02cdb6717a181R319

...where, in order to "move" a structs fields into the output's "properties" field, we roundtrip each of your structs to a serde_json::Object.

We do this round trip with only one feature at a time, not the entire feature collection at once, so its memory impact should be limited, but still, it's pretty gross. With the time I had, I wasn't able to come up with a better solution.

But the bigger problem might be that (currently anyway) serialization requires a &[T: Deserialize], so you already need your entire slice of structs to be in memory.

michaelkirk avatar Aug 25 '22 05:08 michaelkirk

The conversion to JsonObject at Feature level is still okay. Currently, our serialize impl.:

impl Serialize for FeatureCollection {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        JsonObject::from(self).serialize(serializer)
    }
}

converts the whole feature collection into a much larger representation before writing. Typically the user is already storing the representation in memory, and the conversion takes a lot (~10x) more memory. This PR nicely side-steps this issue, right?

rmanoka avatar Aug 25 '22 05:08 rmanoka

Could we get a short doc string here so it's visible in the docs?

I've done this in https://github.com/georust/geojson/pull/199/commits/493463eb9bea7856ae4f04656cdfa06c228956bb

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in https://github.com/georust/geojson/pull/199/commits/269e29d2bdde946ebf607016ec58d1260dc7e42e, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

michaelkirk avatar Aug 25 '22 22:08 michaelkirk

should we call out this functionality early on in the "main" docs?

Done in https://github.com/georust/geojson/pull/199/commits/413b959add564933ef7e1ee76da1a1ed3836834b

michaelkirk avatar Aug 25 '22 22:08 michaelkirk

Could we get a short doc string here so it's visible in the docs?

I've done this in 493463e

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in 269e29d, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

Yep, this seems clear and sensible to me.

urschrei avatar Aug 25 '22 23:08 urschrei

I pushed up some examples/* as documentation and to facilitate memory profiling. I've gotten a couple approvals and only pushed up some non-controversial(I think) followups, so I'll plan to merge in a couple days unless I hear otherwise.

Thanks for the reviews!

Some followup commentary:

It should address https://github.com/georust/geojson/issues/176 too, right?

before: Screen Shot 2022-08-26 at 11 50 56 AM

The left hump is the deserialization, the right hump is the serialization.

after: Screen Shot 2022-08-26 at 11 50 20 AM

Note that there is still a big hump for deserialization. Likely due to my other PERF note: https://github.com/georust/geojson/pull/199/files#diff-a9463680bdf3fa7278b52b437bfbe9072e20023a015621ed23bcb589f6ccd4b5R155

It's not any worse than before, so I'm inclined to merge as-is.

One follow-up solution might be to do something like we do in FeatureIterator, where we don't parse the FeatureCollection, rather we "fast forward" until we hit the "features" property and start parsing there. The current FeatureIterator lacks some robustness (see https://github.com/georust/geojson/pull/200#issuecomment-1208318643) , so I'd rather address that separately than try to fit it into this PR.

michaelkirk avatar Aug 26 '22 19:08 michaelkirk

bors r=urschrei,rmanoka

michaelkirk avatar Aug 29 '22 03:08 michaelkirk

Build succeeded:

bors[bot] avatar Aug 29 '22 03:08 bors[bot]