Use arrow-csv
To get more familiar with things around here, I'm taking a stab at #613. Opening up this draft PR to ask two questions:
- Is #613 still useful? Don't want to work something that's OBE
- If #613 is still useful, I'd appreciate some guidance about the correct use of
MixedGeometryArray::into_arrow. Specifically,into_arrowproduces a "minimal" union (only the types actually present in the mixed array) whileChunkedMixedGeometryArray::data_typeproduces a superset union of all possible types. I've provided some test output below for more info.
Related issues
- Closes #613
More information
The code in the PR works, but feels wrong to me. The "correct" code (to my naive eyes) produces schema mismatch.
Code and test output
table.set_column(
geometry_column_index,
geometries
.data_type()
.to_field(geometry_field.name(), geometry_field.is_nullable())
.into(),
geometries.array_refs(),
)?;
Test output:
test io::csv::reader::tests::read ... FAILED
failures:
---- io::csv::reader::tests::read stdout ----
thread 'io::csv::reader::tests::read' panicked at src/io/csv/reader.rs:127:72:
called `Result::unwrap()` on an `Err` value: Arrow(InvalidArgumentError("column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(1, Field { name: \"geometry\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:metadata\": \"{\\\"crs\\\":null,\\\"edges\\\":null}\", \"ARROW:extension:name\": \"geoarrow.point\"} })], Dense) at column index 3"))
- Is Use arrow-csv for CSV operations #613 still useful? Don't want to work something that's OBE
Yes! I think in general it's better to use arrow tools where possible, and add the spatial part on top, rather than relying on a row-based geozero and having to do the conversion to Arrow ourselves.
Specifically,
into_arrowproduces a "minimal" union (only the types actually present in the mixed array) whileChunkedMixedGeometryArray::data_typeproduces a superset union of all possible types. I've provided some test output below for more info.
Yeah... we have some known issues around the union types. E.g. I got stuck in https://github.com/geoarrow/geoarrow-rs/pull/498 around something quite similar.
I should check with @paleolimbot and @jorisvandenbossche how we want to handle union types canonically. Should we expect that GeoArrow Geometry arrays have all fields or just the minimal fields? I added a comment to https://github.com/geoarrow/geoarrow/pull/43#issuecomment-2139155928 about this.
@kylebarron any movement on the union type issue? I ask because I'm running into this issue w/ parquet writing ATM. It looks like you have some skeleton code here: https://github.com/geoarrow/geoarrow-rs/blob/ccc636c8b2a07a399fa9353792712f27d142b8d9/src/io/parquet/writer/metadata.rs#L244-L266
Do you reckon that's valid, at least as a guess on where things will settle?
It looks like you have some skeleton code here:
I don't think that itself is a relevant part of the code. You're probably hitting union type issues somewhere else. Do you have a specific traceback?
Do you have a specific traceback?
Yeah:
called `Result::unwrap()` on an `Err` value: GeoArrow(Arrow(CastError("Invalid argument error: column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(6, Field { name: \"geometry\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) at column index 8")))
This is happening on a simple roundtrip test in stac-geoparquet (write a mixed geometry table to some bytes, then read it back).
But can you remind me what line of code that's coming from? It's not in the writer code you linked to
https://github.com/geoarrow/geoarrow-rs/pull/714 that'll be a big PR to work on next week.
Sweet ... guessing you've got it, but if need be a minimum reproducible example is in a draft PR here: https://github.com/geoarrow/geoarrow-rs/pull/717.
@gadomski can you try again from latest main?
@kylebarron looks like it's fixed for me over at stac-geoparquet! I'll see if I can update this PR as well 🙇🏼
Closing in favor of https://github.com/geoarrow/geoarrow-rs/pull/826