duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

H3 values modified when exported as JSON but not as CSV

Open marklit opened this issue 2 years ago • 5 comments

I've compiled the H3 extension for DuckDB and loaded it into the DuckDB binary with your extension.

$ git clone https://github.com/isaacbrodsky/h3-duckdb ./duckdb_h3
$ cd ./duckdb_h3
$ git submodule update --init

$ GEN=ninja \
    make release
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb -unsigned aus1.duckdb
install '/Volumes/Seagate/duckdb_h3/build/release/h3.duckdb_extension';
load h3;
load json;

I'm able to pull an example record where the H3_7 value starts with 87be.

.mode line

SELECT geom,
       st_centroid(geom) centroid,
       st_x(st_centroid(geom)) x,
       st_y(st_centroid(geom)) y,
       printf('%x', h3_latlng_to_cell(
                        ST_Y(ST_CENTROID(geom)),
                        ST_X(ST_CENTROID(geom)),
                        7)::bigint) h3_7_yx
from roads2
limit 1;
    geom = LINESTRING (147.084939479828 -30.7773677126431, 147.0849609375 -30.7773584948334, 147.084971666336 -30.7773400592114)
centroid = POINT (147.08495789089685 -30.777356503454)
       x = 147.08495789089685
       y = -30.777356503454
 h3_7_yx = 87beee8a3ffffff

87beee8a3ffffff is in Australia as I expected. https://what-the-h3index.vercel.app/

I then exported that table with the aggregate counts for each H3_7.

COPY (
    SELECT printf('%x',
                  h3_latlng_to_cell(
                      ST_Y(ST_CENTROID(geom)),
                      ST_X(ST_CENTROID(geom)),
                      7)::bigint) as h3_7,
           count(*) count
    FROM  roads2
    GROUP BY 1
) TO 'aus_h3.json';

None of the exported records start with 87be, for some reason only H3_7s prefixed with 87[9ac] are exporting and none of these are anywhere near Australia.

$ jq -S . aus_h3.json | grep h3 | cut -c1-15 | sort | uniq -c
 151   "h3_7": "8798
 469   "h3_7": "8799
 841   "h3_7": "87a4
 144   "h3_7": "87a5
 520   "h3_7": "87c4
 357   "h3_7": "87c5
$ head aus_h3.json
{"h3_7":"87c529c10ffffff","count":6287}
{"h3_7":"87a4c5d6effffff","count":2441}
{"h3_7":"87c431431ffffff","count":19481}
{"h3_7":"87a4c0860ffffff","count":12708}
{"h3_7":"87a4de11cffffff","count":2375}
{"h3_7":"87c519272ffffff","count":4185}
{"h3_7":"87c52b8c4ffffff","count":5311}
{"h3_7":"87a4de95effffff","count":1665}
{"h3_7":"87c433b98ffffff","count":10999}
{"h3_7":"87c52b80dffffff","count":4654}

marklit avatar Apr 08 '23 07:04 marklit

Does it work if you export the entire original subquery? The export itself doesnt reallt have any knowledge of what its exporting in this case since there is no geometry, so I’m hesitant to believe the json driver is to blame.

Maxxen avatar Apr 08 '23 08:04 Maxxen

I'll walk through it end-to-end and see if I can reproduce it with a single record. Maybe I'm missing something.

marklit avatar Apr 08 '23 09:04 marklit

I ran through the whole piece again and used zoom level 2 so there are fewer results to look at. I'm wondering if this is a degree to radiants conversion issue.

$ cat aus_h3.json 
{"h3_2":"82c52ffffffffff","count":161}
{"h3_2":"82c437fffffffff","count":137}
{"h3_2":"82c50ffffffffff","count":218}
{"h3_2":"82c4affffffffff","count":32}
{"h3_2":"82a4c7fffffffff","count":108}
{"h3_2":"82c427fffffffff","count":81}
{"h3_2":"82a477fffffffff","count":42}
{"h3_2":"82a4dffffffffff","count":63}
{"h3_2":"82a457fffffffff","count":29}
{"h3_2":"82a467fffffffff","count":43}
{"h3_2":"82a4e7fffffffff","count":26}
{"h3_2":"82999ffffffffff","count":7}
{"h3_2":"8298b7fffffffff","count":15}
{"h3_2":"82a54ffffffffff","count":20}
{"h3_2":"82998ffffffffff","count":10}
{"h3_2":"82c51ffffffffff","count":5}
{"h3_2":"82c48ffffffffff","count":1}
{"h3_2":"82c417fffffffff","count":2}

marklit avatar Apr 08 '23 13:04 marklit

I'm able to get the data out properly using CSV.

$ vi csv_to_json.py
import json
import sys


recs = [x.split(',')
        for x in sys.stdin.read().strip().splitlines()]

print(json.dumps([{'count': cnt,
                   'hexIds': [h3]}
                  for h3, cnt in recs]))
$ vi h3.sql
.mode csv
LOAD h3;
LOAD json;

SELECT printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(geom)),
                  ST_X(ST_CENTROID(geom)),
                  2)::bigint) as h3_2,
       count(*) count
FROM  roads2
GROUP BY 1;
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
        -unsigned \
        aus.duckdb \
    < h3.sql \
    | python3 csv_to_json.py \
    | tail +2 \
    > aus_h3.duckgl.json

marklit avatar Apr 08 '23 13:04 marklit

wondering if this is a type casting problem? are the JSON values being cast as character, integer, or double? CSV would not be impacted since unlike json it is not typed. do you see this issue in parquet too?

cboettig avatar Mar 10 '24 21:03 cboettig