frostline icon indicating copy to clipboard operation
frostline copied to clipboard

Explore creating a geodata API

Open waldoj opened this issue 7 years ago • 13 comments

It would be useful to let people issue queries using lat/lon instead of just ZIP. But I'm only interested in doing so if I can use a very lightweight hosting process, as per the rest of Frostline.

waldoj avatar May 07 '17 23:05 waldoj

Theory: I can use PRISM's geodata to generate static files for lat/lon pairs at a reasonable resolution. Then queries can be issued at a prescribed level of specificity (e.g., no more than 0.1 degrees, or 11 kilometers, e.g. "39.1, 94.6"). By generating a file for each pair (e.g., 39.1,94.6.json), a static API can serve up all of the data.

waldoj avatar May 07 '17 23:05 waldoj

The simplest way to do this is probably via the provided ARC/INFO ASCII grid files. This represents Puerto Rico as 557 columns by 170 rows, or 94,690 files for that one territory. (Note that each record is not a hardiness zone, but instead the minimum temperature in Celsius, multiplied by 100; it's necessary to convert that to a PHZ before writing out the data.) This implies rather a large number of files for the entire U.S.

waldoj avatar May 07 '17 23:05 waldoj

Of course, it remains to map each cell to a physical location. But this looks easy. The opening stanza of each ASC files opens with metadata like such:

xllcorner -67.31875000000
yllcorner 17.86875000000
cellsize 0.00416666667

So we simply add 0.00416666667 to -67.31875 as we advance through each column, and subtract 0.00416666667 from 17.86875 as we advance through each row.

waldoj avatar May 07 '17 23:05 waldoj

One concern that I have is about the resolution that this yields. Is it reasonably round-able? Will there be collisions? How to handle them?

waldoj avatar May 07 '17 23:05 waldoj

The continental U.S. is 7,025 columns by 3,105 rows, or 21,812,625 files. That certainly is a very large number of files. Hawaii is another 1,077,008 records, and Alaska another 808,505 (an oddly small number), a total of 23,792,828 records. Some percentage of these will be blank (that is, have a value of -9999), although the approximate shape of the U.S. means that it will not be an enormous percentage. Perhaps 20%, thanks to Florida and Maine? So that leaves about 19 million records, should no simplification be performed.

waldoj avatar May 07 '17 23:05 waldoj

It turns out that these files are at different resolutions. The continental U.S. is at 800-meter resolution, Hawaii and Puerto Rico are at 400-meter resolution, and Alaska is at 4,000-meter resolution. That raises the interesting question of how to apply consistent data-density standards across the board.

The continental U.S. data has a cell size of 0.00833333333, or just under 0.01 degrees. At a resolution of 0.1 degrees, we'd be using every 13th cell, or 1/169th of the entire dataset. That would leave us with a very manageable 104,142 JSON files for the continental U.S. (Whether that is sufficient resolution to accurately capture PHZ is a different question.)

waldoj avatar May 07 '17 23:05 waldoj

If we had a resolution of 0.01 degrees, that would leave us with about 14,666,080 records for the continental U.S., or 140 times more than at 0.1 degrees of resolution.

waldoj avatar May 08 '17 00:05 waldoj

Each of the 26 zones represent a spread of 5°. So these zones are not particularly refined. Spot-checking some rows from the data, I feel good that a resolution of 0.1 degrees is adequate.

However, aggregating lat/lon pairs is inherently going to result in some inaccuracies, for places on the bubble. For instance, these two stanzas

-18 -18 -18 -19 -19 -19 -19 -19 -20 -20 -20 -20 -20
-20 -20 -19 -19 -19 -19 -19 -18 -18 -19 -20 -20 -20

would be reduced to -18 and -20 (assuming we sampled the first entry), or 5a and 4b, respectively. But 5 of the entries in the first stanza are actually 4b, and 8 of the entires in the second stanza are actually 5a. Averaging doesn't help this problem. In this instance, the average of both is -19, and it would leave every one of these places in zone 5a, despite that 10 of the 26 are in zone 4b.

Basically, the question here is what level of accuracy is acceptable. Perhaps it's worth a reduction in accuracy to reduce the number of files by 99.5%. But is it worth any reduction in accuracy to cut out just 17% of records?

waldoj avatar May 08 '17 00:05 waldoj

The next thing to do is some benchmarking—figure out how much data that we're talking about here, and how long it will take to generate those files.

waldoj avatar May 08 '17 01:05 waldoj

If we use the same file format as we do for the ZIPs (unnecessarily repeating the lat/lon pair within the JSON), that's 92 bytes per file, or 1.7 GB of data for every data point in the entire U.S. (assuming, as always, that 20% of data points in the ASC files have no value). That's not bad. If we used a resolution of 0.1 degrees, that would be a mere 8 MB in total.

If we eliminate the repeated lat/lon pair, that brings the total size down to 831 MB.

waldoj avatar May 08 '17 01:05 waldoj

An advantage of using the native file resolutions is that this largely allows us to avoid the question of what to do about three different resolutions. But it may not help with the question of the prescribed degree of precision in the queries.

That the native continental data resolution is just below 0.01 degrees of resolution means that, of course, we need to use three digits. This is problematic though, because we wind up with vast swaths of namespace that are blank (e.g., we have a record for "39.101, 94.655," but not for the requested "39.102, 94.655"). This indicates that—again, just for the continental U.S.—we need to round to two decimal digits. Basically, we are inherently going to wind up with a certain degree of inaccuracy, but this 13% reduction means that no more than 13% of records have a chance of becoming inaccurate.

Alaska presents an awkward arrangement, with its 4,000-meter resolution. We're going to have to either have less granularity on Alaska queries (this is bad), or used something like a nearest-neighbor algorithm to fake it.

Hawaii and Puerto Rico's 400-meter resolution will necessitate rather more averaging, so we'll have more edge cases there.

waldoj avatar May 08 '17 01:05 waldoj

I'm just going to throw this out there...what about symlinks? Millions of symlinks? They're smaller than actual json files and they'd allow you to have whatever granularity you wanted with no gaps.

mlissner avatar May 08 '17 16:05 mlissner

Ooooh, of course! Excellent idea!

waldoj avatar May 08 '17 16:05 waldoj