Geometric lookups and "vector data cubes"
It's super easy to create dimension axes that have vector lookups:
julia> using CairoMakie, DimensionalData
julia> points = Point2f.(1, 1:10)
10-element Vector{Point{2, Float32}}:
[1.0, 1.0]
[1.0, 2.0]
[1.0, 3.0]
[1.0, 4.0]
[1.0, 5.0]
[1.0, 6.0]
[1.0, 7.0]
[1.0, 8.0]
[1.0, 9.0]
[1.0, 10.0]
julia> rand(X(DimensionalData.Categorical(points)))
╭────────────────────────────────╮
│ 10-element DimArray{Float64,1} │
├────────────────────────────────┴─────────────────────────────────────── dims ┐
↓ X Categorical{Point{2, Float32}} [Float32[1.0, 1.0], Float32[1.0, 2.0], …, Float32[1.0, 9.0], Float32[1.0, 10.0]] ForwardOrdered
└──────────────────────────────────────────────────────────────────────────────┘
Float32[1.0, 1.0] 0.122667
Float32[1.0, 2.0] 0.670468
Float32[1.0, 3.0] 0.398075
Float32[1.0, 4.0] 0.275494
Float32[1.0, 5.0] 0.437735
Float32[1.0, 6.0] 0.333305
Float32[1.0, 7.0] 0.43346
Float32[1.0, 8.0] 0.740607
Float32[1.0, 9.0] 0.48268
Float32[1.0, 10.0] 0.221809
but some questions arise here.
- How do we plot them? For scatter I could see just using specapi to forward
coloras the values of the dimvector. What about 2d arrays? - For e.g. weather station data, you would have a time axis and a position or geometry axis. How would people use data from here? What are common access patterns?
- Can we make selectors like
Contains,Touches,..,At, etc work, potentially using GeometryOps? Do we want an extension for this? I think this will involve geometric predicates at least. - Do we want to construct spatial indices for such geometric lookups?
- Can we set
GI.geometrycolumnsof a raster to point to geometric lookups? Do we want to stash this in the metadata?
Dimensional data.jl already handles point lookups a bit, see mergedims. But polygons etc need more work. Probably we make a GeometryLookups.jl package so everyone can use them?
scatterdefault is good for points with color as the values. if there is another dimension like time we plot points in 3d? and the same for lines/polygons etc- This already works for points with
MergedLookup, we can use that syntax for everything else too - See
merged.jlfor the point implementation, for polygons we would need special lookups. - That could be better than what
MergedLookupdoes currently. Its slow. - Yep we can do that. One hitch is that geometry can be either values in the array or in a lookup, but we can detect that. (a confusing part of the "vector data cube" talk is people use and discuss both but theyre totally different things)
I just realized that geometric lookups are a great way to represent discretized global grids, since each grid cell is just an integer - a DGGSLookup could be a subtype of some AbstractGeometricLookup, and with a nice API we could have super easy plotting / subsetting / rasterization to a Cartesian grid!
For sure. Except for the nesting, and there should be optimisations for finding polygons on these grids?
If you are interested in DGGS there is also https://github.com/danlooo/DGGS.jl
Thanks Felix, that seems interesting! Are those actually DimArrays though, or just convenient wrapper structs? I went over the code briefly but am not able to see how this is working...
I just wrote up a more interesting vector data cube example with polygons. See the following:
using DimensionalData
using DimensionalData.Lookups
using DimensionalData.Dimensions
import DimensionalData as DD
# load data
using NaturalEarth
import GeoInterface as GI, GeometryOps as GO
fc = naturalearth("admin_0_countries", 10)
geometries_data = fc.geometry
id_data = fc.ADM0_A3
pop_data = fc.POP_EST .|> Float64
# now for the fun part
# let's define a dimension named Geometry
# TODO: how can we make this be interpreted as (X, Y)? can we somehow hook into the same things that MergeDims does?
@dim Geometry
# create a categorical lookup from the geometry
geometry_lookup = Categorical(geometries_data)
# create dimvectors which have geometry lookups - simple so far
id_dd = DimVector(id_data, Geometry(geometry_lookup))
pop_dd = DimVector(pop_data, Geometry(geometry_lookup))
# stack them so they share dims
fc_stack = DimStack((; id = id_dd, pop = pop_dd))
# Now for the _really_ fun part: mixing dimensions
# create a 2d dimarray, that hypothetically has height measurements over time for each geometry
height_dim_matrix = rand(Geometry(geometry_lookup), Ti(1:10))
# stack 'em!
full_stack = DimStack((; id = id_dd, pop = pop_dd, height = height_dim_matrix))
# select the USA
full_stack[Geometry(Where(x -> GO.contains(x, (-103, 44))))]
full_stack[Geometry(Where(x -> GO.contains(x, (-103, 44))))].id
full_stack[Geometry(Where(x -> GO.contains(x, (-103, 44))))].pop
full_stack[Geometry(Where(x -> GO.contains(x, (-103, 44))))].height
this is pretty sweet I think...now just have to figure out how to actually write this in a CF compliant way, and how to detect these lookups when plotting 😢
Just using Where is funny, like that just works as is
I guess we can accelerate that for GO predicates with an Extent index, if we use a Fix2
yep! I'm building out a PolygonLookup now that would support that, but we could change that to be a more generic GeometricLookup at some point that works for any geometries
btw the big issue here is probably going to be file I/O. Any suggestions on that? NetCDF has some support for this but I have no clue how we would get this out from a nc file:
example.nc.txt (remove the .txt after, that's for github sake). This file has polygons with holes, if we can handle that I figure we're pretty much done haha
Have you looked at the python/R vector data cubes ? If we can use something already being standardised...
(The CF standard is likely pretty arcane)
This is really exciting!
And making this a DataFrame almost works, except height vectors are concatenated and scalars atrtibutes are duplicated :
DataFrame(full_stack)
2580×5 DataFrame
Row │ Geometry Ti id pop height
│ Abstract… Int64 String Float64 Float64
──────┼─────────────────────────────────────────────────────────────
1 │ 2D MultiPolygon 1 IDN 2.70626e8 0.379892
2 │ 2D MultiPolygon 1 MYS 3.19498e7 0.851437
3 │ 2D MultiPolygon 1 CHL 1.8952e7 0.0670016
4 │ 2D Polygon 1 BOL 1.15131e7 0.925698
5 │ 2D MultiPolygon 1 PER 3.25105e7 0.699816
Yeah...Ideally we would want a sliced dimtable from the stack, with a given index column
Have you looked at the python/R vector data cubes ? If we can use something already being standardised...
(The CF standard is likely pretty arcane)
They also use CF :((
Ok guess we will have to implement it...
For the IO part, there might be some inspiration in https://github.com/xarray-contrib/xvec/issues/26
Lat-lon-Cap grid used in NASA reanalyses and MITgcm : https://juliaclimate.github.io/MeshArrays.jl/dev/tutorials/geography.html
DD._dims2indices is where we would implement an Unaligned like thing that would let us harvest X and Y from getindex and move it into geometry lookups
There should always be some selector for X and Y, i.e X(10) should never work with a geometric lookup unless there is another X dim
Have you looked at the python/R vector data cubes ? If we can use something already being standardised...
(The CF standard is likely pretty arcane)
They also use CF :((
We are now working on non-CF encoding that supports geometry as variable, not only dimension coordinates. The CF encoding is indeed a bit inconvenient but at least it is an existing standard. Most of that lives on my branch so far but here's the notebook illustrating the idea https://github.com/martinfleis/xvec/blob/summary/doc/source/io.ipynb.
Thanks @martinfleis that's a great write-up
@asinghvi17 I just realised the nested discrete global grid can be represented with a DimTree and geometry lookups. I think it will already work if we merged the branches.
Thanks for the write up, that clarifies a lot!
@rafaqz we can indeed do a dimtree, but there is one issue - we may need to implement s2 indexing 😅 or some other accelerate, at least for either DGGS or known spherical/ellipspodal geometries.
The generic geometry lookup we have should work, but slowly?
But yes we need to make it abstract and specialise a few fast cases like hexagons as @meggart mentioned they are doing in DGGS
@danlooo
Regarding dimtrees, note that a cell may have multiple parents in some DGGS, e.g. here Related to https://github.com/danlooo/DGGS.jl/issues/133
I think that's fine, the tree isn't spatially nested explicitly like that (although it seems there could be some nice optimisations where that is possible)
It just allows layers with different lookups for the same dimensions, that could at the same time share e.g. the time dimension from lower in the tree.
So selecting tree[X=a .. b] will select across all branches even though X is not the same resolution.
The extent will be the union over all branches, so they don't have to match at the boundaries either. Branches could equally represent separate raster tiles of the same resolution.
The only potential hiccup here is that DGGS are, at this point, more like matrix (actually 3d array) lookups than like a single vector flat geometry lookup.
So it would need a custom ArrayLookup and Nearest neighbour searches or something for At/Near etc?
We could just tweak ArrayLookup ? Or do I misunderstand
@asinghvi17 Both are possible, even on the same grid, e.g. Uber H3 normal 1D bit index vs 3D IJK index. Hierarchical indices are usually 1D, having the parent cell as a prefix, optimized for aggregation queries. However, when it comes to distant neighbour queries, e.g. bouding boxes for viewing and convolutions, one wants to have multiple spatial DGGS dimensions more like a matrix.