geoarrow-rs icon indicating copy to clipboard operation
geoarrow-rs copied to clipboard

Use arrow-csv

Open gadomski opened this issue 1 year ago • 10 comments

To get more familiar with things around here, I'm taking a stab at #613. Opening up this draft PR to ask two questions:

  • Is #613 still useful? Don't want to work something that's OBE
  • If #613 is still useful, I'd appreciate some guidance about the correct use of MixedGeometryArray::into_arrow. Specifically, into_arrow produces a "minimal" union (only the types actually present in the mixed array) while ChunkedMixedGeometryArray::data_type produces a superset union of all possible types. I've provided some test output below for more info.

Related issues

  • Closes #613

More information

The code in the PR works, but feels wrong to me. The "correct" code (to my naive eyes) produces schema mismatch.

Code and test output
table.set_column(
    geometry_column_index,
    geometries
        .data_type()
        .to_field(geometry_field.name(), geometry_field.is_nullable())
        .into(),
    geometries.array_refs(),
)?;

Test output:

test io::csv::reader::tests::read ... FAILED

failures:

---- io::csv::reader::tests::read stdout ----
thread 'io::csv::reader::tests::read' panicked at src/io/csv/reader.rs:127:72:
called `Result::unwrap()` on an `Err` value: Arrow(InvalidArgumentError("column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(1, Field { name: \"geometry\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:metadata\": \"{\\\"crs\\\":null,\\\"edges\\\":null}\", \"ARROW:extension:name\": \"geoarrow.point\"} })], Dense) at column index 3"))

gadomski avatar May 29 '24 20:05 gadomski

Yes! I think in general it's better to use arrow tools where possible, and add the spatial part on top, rather than relying on a row-based geozero and having to do the conversion to Arrow ourselves.

kylebarron avatar May 30 '24 09:05 kylebarron

Specifically, into_arrow produces a "minimal" union (only the types actually present in the mixed array) while ChunkedMixedGeometryArray::data_type produces a superset union of all possible types. I've provided some test output below for more info.

Yeah... we have some known issues around the union types. E.g. I got stuck in https://github.com/geoarrow/geoarrow-rs/pull/498 around something quite similar.

I should check with @paleolimbot and @jorisvandenbossche how we want to handle union types canonically. Should we expect that GeoArrow Geometry arrays have all fields or just the minimal fields? I added a comment to https://github.com/geoarrow/geoarrow/pull/43#issuecomment-2139155928 about this.

kylebarron avatar May 30 '24 09:05 kylebarron

@kylebarron any movement on the union type issue? I ask because I'm running into this issue w/ parquet writing ATM. It looks like you have some skeleton code here: https://github.com/geoarrow/geoarrow-rs/blob/ccc636c8b2a07a399fa9353792712f27d142b8d9/src/io/parquet/writer/metadata.rs#L244-L266

Do you reckon that's valid, at least as a guess on where things will settle?

gadomski avatar Aug 23 '24 22:08 gadomski

It looks like you have some skeleton code here:

I don't think that itself is a relevant part of the code. You're probably hitting union type issues somewhere else. Do you have a specific traceback?

kylebarron avatar Aug 23 '24 22:08 kylebarron

Do you have a specific traceback?

Yeah:

called `Result::unwrap()` on an `Err` value: GeoArrow(Arrow(CastError("Invalid argument error: column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(6, Field { name: \"geometry\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) at column index 8")))

This is happening on a simple roundtrip test in stac-geoparquet (write a mixed geometry table to some bytes, then read it back).

gadomski avatar Aug 23 '24 22:08 gadomski

But can you remind me what line of code that's coming from? It's not in the writer code you linked to

kylebarron avatar Aug 23 '24 22:08 kylebarron

https://github.com/geoarrow/geoarrow-rs/pull/714 that'll be a big PR to work on next week.

kylebarron avatar Aug 23 '24 22:08 kylebarron

Sweet ... guessing you've got it, but if need be a minimum reproducible example is in a draft PR here: https://github.com/geoarrow/geoarrow-rs/pull/717.

gadomski avatar Aug 26 '24 12:08 gadomski

@gadomski can you try again from latest main?

kylebarron avatar Aug 26 '24 14:08 kylebarron

@kylebarron looks like it's fixed for me over at stac-geoparquet! I'll see if I can update this PR as well 🙇🏼

gadomski avatar Aug 26 '24 15:08 gadomski

Closing in favor of https://github.com/geoarrow/geoarrow-rs/pull/826

gadomski avatar Oct 22 '24 00:10 gadomski