geojson
geojson copied to clipboard
Deserialize geojson into custom struct using serde
- [x] I agree to follow the project's code of conduct.
- [x] I added an entry to
CHANGES.md
if knowledge of this change could be valuable to users.
This is an attempt to improve the ergonomics of parsing GeoJson using serde. The API is somewhat inspired by my usage of the csv
crate
This PR is a draft because there are a lot of error paths related to invalid parsing that I'd like to add tests for, but I first wanted to check in on overall direction of the API. What do people think?
Here's what it looks like...
Given some geojson like this:
let feature_collection_string = r#"{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
},
"properties": {
"name": "Dinagat Islands",
"age": 123
}
},
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [2.3, 4.5]
},
"properties": {
"name": "Neverland",
"age": 456
}
}
]
}"#
.as_bytes();
let io_reader = std::io::BufReader::new(feature_collection_string);
You used to parse it like this:
use geojson:: FeatureIterator;
let feature_reader = FeatureIterator::new(io_reader);
for feature in feature_reader.features() {
let feature = feature.expect("valid geojson feature");
let name = feature.property("name").unwrap().as_str().unwrap();
let age = feature.property("age").unwrap().as_u64().unwrap();
let geometry = feature.geometry.value.try_into().unwrap();
if name == "Dinagat Islands" {
assert_eq!(123, age);
assert_matches!(geometry, geo_types::Point::new(125.6, 10.1).into());
} else if name == "Neverland" {
assert_eq!(456, age);
assert_matches!(geometry, geo_types::Point::new(2.3, 4.5).into());
} else {
panic!("unexpected name: {}", name);
}
}
But now you also have the option of parsing it into your own custom struct using serde like this:
use geojson::deserialize::deserialize_geometry;
use geojson::FeatureReader;
use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct MyStruct {
// You can parse directly to geo_types if you want, otherwise this will be a `geojson::Geometry`
#[serde(deserialize_with = "deserialize_geometry")]
geometry: geo_types::Geometry<f64>,
name: String,
age: u64,
}
let feature_reader = FeatureReader::from_reader(io_reader);
for feature in feature_reader.deserialize::<MyStruct>().unwrap() {
let my_struct = feature.expect("valid geojson feature");
if my_struct.name == "Dinagat Islands" {
assert_eq!(my_struct.age, 123);
assert_matches!(my_struct.geometry, geo_types::Point::new(125.6, 10.1).into());
} else if my_struct.name == "Neverland" {
assert_eq!(my_struct.age, 456);
assert_matches!(my_struct.geometry, geo_types::Point::new(2.3, 4.5).into());
} else {
panic!("unexpected name: {}", my_struct.name);
}
}
Related to https://github.com/georust/geojson/issues/184
I like this a lot! The only possible feature I can think of at the moment is the ability to have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time (I could see that being tricky to implement).
I would have definitely liked something like this when I was doing geospatial work last summer.
have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time
That's a great idea. I think I can add that pretty easily actually. Thanks for looking!
have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time
I've pushed up another commit that does just this.
I like the idea a lot. Do you have in mind to add serialization capabilities as well? I was wondering how a roundtrip would look like in regards to foreign members. Could they be preserved with the current approach to potentially achieve a lossless roundtrip?
Thanks for looking @b4l!
Could [foreign members] be preserved with the current approach to potentially achieve a lossless roundtrip?
Not with the current approach. An example of the current behavior to highlight this:
{
"type": "Feature",
"foo": "bar", // <-- foreign member "foo"
properties: {
"foo": 123 // <-- property "foo"
}
}
The approach in this PR flattens properties
into the top level of the struct, and simply ignores any foreign members:
MyStruct {
geometry: Geometry,
foo: u64
}
I can see why this behavior might not work for some use cases. The use case driving this PR is to have processing code that is agnostic (or requires only minimal changes) between formats. e.g. I'd like to have code that can read it's input from either GeoJSON or CSV (and some day maybe other formats as well) while only changing a couple lines of code.
Having multiple sources of inputs requires a sort of "lowest common denominator" of what can be accounted for, and foreign members fall outside of that.
If you did want to keep all the structural information, like foreign members, you can still do that same as always. You have to deserialize to a geojson::Feature
or maybe even just a json::Object
depending on your use cases.
It's also possible that if we threw enough proc_macro at the problem that we could come up with something more powerful, but I don't plan to work on that just yet.
Do you have in mind to add serialization capabilities as well?
Oh sorry I forgot to answer this in my original reply. If this seems like a good overall direction, I'm happy to add serialization.
Thanks for the explanations @michaelkirk !
My two potential use cases I have in mind are enforcing a certain schema with a concrete type/struct and support for https://github.com/opengeospatial/ogc-feat-geo-json.
This PR gives us the ability to enforce a concrete geometry type and a certain properties struct. I don't really know how important my use cases are, just wanted to mention them to raise awareness.
This feature would be brilliant!
This is indeed a fantastic idea: really flexible and ergonomic conversion to/from geojson! Thanks @michaelkirk
Thanks for the feedback everybody - I think this is ready for review!
Do you have in mind to add serialization capabilities as well?
I've gone and done this and made some other substantial changes, error handling, and docs. It's a big diff, but a lot of it is tests and docs.
I've updated the description with some caveats. Please review them and let me know what you think.
Especially interesting to me would be if you're willing to integrate this into an actual use case. I'm curious to see what kind of rough edges surface. I'm happy to help with the integration if you want to ping me.
Also, I'd super appreciate if anyone has time to build and review the new docs.
The ser. support is awesome! It should address #176 too, right?
closes #176
The ser. support is awesome! It should address https://github.com/georust/geojson/issues/176 too, right?
Unfortunately, it's probably not much of a performance improvement at this point. In fact, it might be worse in some ways.
For one, there's this really unfortunate PERF
note here: https://github.com/georust/geojson/pull/199/files#diff-02292e8d776b8c5c924b5e32f227e772514cb68b37f3f7b384f02cdb6717a181R319
...where, in order to "move" a structs fields into the output's "properties" field, we roundtrip each of your structs to a serde_json::Object
.
We do this round trip with only one feature at a time, not the entire feature collection at once, so its memory impact should be limited, but still, it's pretty gross. With the time I had, I wasn't able to come up with a better solution.
But the bigger problem might be that (currently anyway) serialization requires a &[T: Deserialize]
, so you already need your entire slice of structs to be in memory.
The conversion to JsonObject
at Feature
level is still okay. Currently, our serialize impl.:
impl Serialize for FeatureCollection {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: Serializer,
{
JsonObject::from(self).serialize(serializer)
}
}
converts the whole feature collection into a much larger representation before writing. Typically the user is already storing the representation in memory, and the conversion takes a lot (~10x) more memory. This PR nicely side-steps this issue, right?
Could we get a short doc string here so it's visible in the docs?
I've done this in https://github.com/georust/geojson/pull/199/commits/493463eb9bea7856ae4f04656cdfa06c228956bb
And then I realized that the short doc string was basically indistinguishable from FeatureIterator
. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator
in https://github.com/georust/geojson/pull/199/commits/269e29d2bdde946ebf607016ec58d1260dc7e42e, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features
.
The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.
So to recap the new suggested usage:
FeatureReader::features()
gives you an iterator over Feature
s, just like FeatureIterator::new(io_stream)
does today.
And then:
FeatureReader::.deserialize()
gives you an iterator over your custom serde struct.
This is somewhat inspired by the approach taken by csv::Reader::records()
vs. csv::Reader::deserialize()
WDYT?
should we call out this functionality early on in the "main" docs?
Done in https://github.com/georust/geojson/pull/199/commits/413b959add564933ef7e1ee76da1a1ed3836834b
Could we get a short doc string here so it's visible in the docs?
I've done this in 493463e
And then I realized that the short doc string was basically indistinguishable from
FeatureIterator
. To avoid the confusion to users of having similar sounding things, I decided to deprecate the publicFeatureIterator
in 269e29d, in favor of accessing it opaquely (as an impl Iterator) onFeatureReader::features
.The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.
So to recap the new suggested usage:
FeatureReader::features()
gives you an iterator overFeature
s, just likeFeatureIterator::new(io_stream)
does today.And then:
FeatureReader::.deserialize()
gives you an iterator over your custom serde struct.This is somewhat inspired by the approach taken by
csv::Reader::records()
vs.csv::Reader::deserialize()
WDYT?
Yep, this seems clear and sensible to me.
I pushed up some examples/* as documentation and to facilitate memory profiling. I've gotten a couple approvals and only pushed up some non-controversial(I think) followups, so I'll plan to merge in a couple days unless I hear otherwise.
Thanks for the reviews!
Some followup commentary:
It should address https://github.com/georust/geojson/issues/176 too, right?
before:
The left hump is the deserialization, the right hump is the serialization.
after:
Note that there is still a big hump for deserialization. Likely due to my other PERF note: https://github.com/georust/geojson/pull/199/files#diff-a9463680bdf3fa7278b52b437bfbe9072e20023a015621ed23bcb589f6ccd4b5R155
It's not any worse than before, so I'm inclined to merge as-is.
One follow-up solution might be to do something like we do in FeatureIterator
, where we don't parse the FeatureCollection, rather we "fast forward" until we hit the "features"
property and start parsing there. The current FeatureIterator lacks some robustness (see https://github.com/georust/geojson/pull/200#issuecomment-1208318643) , so I'd rather address that separately than try to fit it into this PR.
bors r=urschrei,rmanoka