datashader Data formats for triangle meshes

https://github.com/bokeh/datashader/pull/525 adds rendering for triangle meshes, as long as data is provided in a specific format. Other formats worth discussing include:

Row per triangle (current):

val x0 y0 x1 y1 x2 y2

Pros: Simple, efficient to iterate over, easy to stream triangles as needed, easy to associate value with each triangle Cons: Only allows one value per triangle, hard to interpolate (have to find values for neighboring triangles)

Row per node:

nodeid x y val1 val2

Pros: Allows arbitrary values to be associated with each node, including new values as needed (as new columns should be cheap to add in Pandas for a fixed set of rows); easy to index into large numbers of nodes Cons: Requires separate mesh structure (e.g. triangleid nodeid1 nodeid2 nodeid3); requires conventions about column names to be able to index across time

Row per observation

time node1val node2val node3val node4val ...

Pros: Fits with streaming dataframe approaches, for visualizing long time series generated incrementally Cons: that's a lot of columns; probably not efficient for time or space in Pandas. Requires separate mesh structure (e.g. triangleid nodeid1 nodeid2 nodeid3). Requires separate node location structure (e.g. nodeid x y).

xarray n-D array

Cube of nodeid, valname, time

Pros: Seems like a natural way to index Cons: Not sure how to add data to a cube like that, which is needed for incrementally processing data from a simulator as it arrives over time?

Nov 14 '17 19:11 jbednar

Hey @jbednar, it's worth noting that with the current (row per triangle) approach, there is a way to associate values for each vertex - for example:

val val0 val1 val2 x0 y0 x1 y1 x2 y2

By extension, if someone were to want to add edge values, they could do something similar to:

val val0 val1 val2 e0 e1 e2 x0 y0 x1 y1 x2 y2

I'm not saying we should go with this approach - it's not particularly memory-efficient, and kind of an eyesore - but I think it's worth pointing out for posterity.

Nov 14 '17 19:11 gbrener

Sure, but if you have a value per vertex in that format, then any given vertex (or edge, for that matter) will show up in that dataframe with two or so with different values (one per triangle in which it appears). That's not just wasteful of memory, it opens an ambiguity about what the value of that vertex is. I think that means that the data structure will only ever make sense to be a derived data structure from some other master dataset that uniquely identifies a value with each vertex. Whereas it's a perfectly reasonable, sustainable format for the case when there is one value per triangle.

Nov 14 '17 19:11 jbednar

Here are some snippets of how I would approach this with DataArray:

At one point in time

from collections import OrderedDict
import numpy as np
import xarray as xr

xyzabc = np.random.uniform(0, 1, (1000, 6))
tri = np.c_[tuple(np.random.randint(0, xyzabc.shape[0], 2000) for _ in range(6))]

layer = ['x', 'y', 'z', 'other_1', 'other_2', 'other_3']
arr_xyz = xr.DataArray(xyzabc,
                         coords=[('node_id', np.arange(1000)),
                                 ('layer', layer)],
                         dims=('node_id','layer'))

arr_tri = xr.DataArray(tri,
                         coords=[('tri_id', np.arange(2000)),
                                 ('node', ['n1', 'n2', 'n3', 'U', 'V', 'depth'])],
                         dims=('tri_id', 'node'))


def get_x1_x2_x3(arr_xyz, arr_tri):
    return np.c_[tuple(arr_xyz[arr_tri.values[:, i]][:3].values for i in range(3))]

In [2]: arr_xyz
Out[2]:
<xarray.DataArray (node_id: 1000, layer: 6)>
array([[ 0.932327,  0.509483,  0.071723,  0.766976,  0.760727,  0.965754],
       [ 0.41747 ,  0.616555,  0.938868,  0.546678,  0.601699,  0.647546],
       [ 0.512001,  0.909487,  0.956083,  0.462362,  0.780362,  0.037038],
       ...,
       [ 0.517562,  0.03362 ,  0.4215  ,  0.620497,  0.086293,  0.507193],
       [ 0.958167,  0.762692,  0.561518,  0.282368,  0.522143,  0.292995],
       [ 0.200847,  0.951575,  0.178492,  0.700508,  0.784168,  0.199431]])
Coordinates:
  * node_id  (node_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer    (layer) <U7 'x' 'y' 'z' 'other_1' 'other_2' 'other_3'

In [3]: arr_tri
Out[3]:
<xarray.DataArray (tri_id: 2000, node: 6)>
array([[687,  39, 552, 741, 236, 299],
       [635, 427, 427, 304, 554, 488],
       [702, 591, 987, 924, 136, 109],
       ...,
       [685, 850, 234, 956, 543, 368],
       [ 75, 931, 348, 953, 335,  88],
       [437,  60, 416, 516, 997, 965]])
Coordinates:
  * tri_id   (tri_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * node     (node) <U5 'n1' 'n2' 'n3' 'U' 'V' 'depth'

Time series

@jbednar @gbrener Can we assume the triangulation doesn't change in time?

time = np.arange(10)

arr_xyz_time, arr_tri_time = (xr.concat(tuple(a for x in range(time.size)), dim='time') for a in (arr_xyz, arr_tri))

arr_tri_time.coords['time'] = time
arr_xyz_time.coords['time'] = time

In [7]: arr_xyz_time
Out[7]:
<xarray.DataArray (time: 10, node_id: 1000, layer: 6)>
array([[[ 0.932327,  0.509483, ...,  0.760727,  0.965754],
        [ 0.41747 ,  0.616555, ...,  0.601699,  0.647546],
        ...,
        [ 0.958167,  0.762692, ...,  0.522143,  0.292995],
        [ 0.200847,  0.951575, ...,  0.784168,  0.199431]],

       [[ 0.932327,  0.509483, ...,  0.760727,  0.965754],
        [ 0.41747 ,  0.616555, ...,  0.601699,  0.647546],
        ...,
        [ 0.958167,  0.762692, ...,  0.522143,  0.292995],
        [ 0.200847,  0.951575, ...,  0.784168,  0.199431]],

       ...,
       [[ 0.932327,  0.509483, ...,  0.760727,  0.965754],
        [ 0.41747 ,  0.616555, ...,  0.601699,  0.647546],
        ...,
        [ 0.958167,  0.762692, ...,  0.522143,  0.292995],
        [ 0.200847,  0.951575, ...,  0.784168,  0.199431]],

       [[ 0.932327,  0.509483, ...,  0.760727,  0.965754],
        [ 0.41747 ,  0.616555, ...,  0.601699,  0.647546],
        ...,
        [ 0.958167,  0.762692, ...,  0.522143,  0.292995],
        [ 0.200847,  0.951575, ...,  0.784168,  0.199431]]])
Coordinates:
  * node_id  (node_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer    (layer) <U7 'x' 'y' 'z' 'other_1' 'other_2' 'other_3'
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9

In [8]: arr_tri_time
Out[8]:
<xarray.DataArray (time: 10, tri_id: 2000, node: 6)>
array([[[687,  39, ..., 236, 299],
        [635, 427, ..., 554, 488],
        ...,
        [ 75, 931, ..., 335,  88],
        [437,  60, ..., 997, 965]],

       [[687,  39, ..., 236, 299],
        [635, 427, ..., 554, 488],
        ...,
        [ 75, 931, ..., 335,  88],
        [437,  60, ..., 997, 965]],

       ...,
       [[687,  39, ..., 236, 299],
        [635, 427, ..., 554, 488],
        ...,
        [ 75, 931, ..., 335,  88],
        [437,  60, ..., 997, 965]],

       [[687,  39, ..., 236, 299],
        [635, 427, ..., 554, 488],
        ...,
        [ 75, 931, ..., 335,  88],
        [437,  60, ..., 997, 965]]])
Coordinates:
  * tri_id   (tri_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * node     (node) <U5 'n1' 'n2' 'n3' 'U' 'V' 'depth'
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9

Nov 14 '17 20:11 PeterDSteinberg

Can we assume the triangulation doesn't change in time?

Yes.

Nov 14 '17 20:11 jbednar

So it seems the consensus here has gone with the xarray based approach. While I agree that it is the most sensible format out of those listed here I do foresee some issues with it. By making the nodes and layers into coordinates you're forcing everything to be one large array, which means all the data will be coerced to the same type. In the example above that already seems problematic since the arr_tri example declares the vector U and V components as well as a depth, yet the actual array values are integers, because the array primarily represents the node ids. Personally I'd therefore recommend splitting the data up further and having the nodes and triangles datastructures be separate from any additional variables.

Here's an example of what I mean:

from collections import OrderedDict
import numpy as np
import xarray as xr

xyzabc = np.random.uniform(0, 1, (1000, 3))
tri = np.c_[tuple(np.random.randint(0, xyzabc.shape[0], 2000) for _ in range(3))]

layer = ['x', 'y', 'z']
arr_xyz = xr.DataArray(xyzabc,
                         coords=[('node_id', np.arange(1000)),
                                 ('layer', layer)],
                         dims=('node_id','layer'))

node_metadata = np.random.rand(1000)
meta_arr = xr.DataArray(node_metadata,
                 coords=[('node_id', np.arange(1000))],
                 dims=['node_id'])

arr_tri = xr.DataArray(tri,
                         coords=[('tri_id', np.arange(2000)),
                                 ('node', ['n1', 'n2', 'n3'])],
                         dims=('tri_id', 'node'))

U = np.random.rand(2000)
arr_U = xr.DataArray(U, coords=[('tri_id', np.arange(2000))], dims=['tri_id'])

V = np.random.rand(2000)
arr_V = xr.DataArray(V, coords=[('tri_id', np.arange(2000))], dims=['tri_id'])

trimesh_ds = xr.Dataset({'tri': arr_tri, 'xyz': arr_xyz, 'node_info': meta_arr,
                         'U': arr_U, 'V': arr_V})
trimesh_ds

<xarray.Dataset>
Dimensions:    (layer: 3, node: 3, node_id: 1000, tri_id: 2000)
Coordinates:
  * tri_id     (tri_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * node       (node) <U2 'n1' 'n2' 'n3'
  * node_id    (node_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * layer      (layer) <U1 'x' 'y' 'z'
Data variables:
    tri        (tri_id, node) int64 139 866 831 555 524 900 791 987 47 621 ...
    xyz        (node_id, layer) float64 0.4694 0.4514 0.4658 0.5118 0.5406 ...
    node_info  (node_id) float64 0.03132 0.1233 0.5175 0.9879 0.3268 0.4175 ...
    U          (tri_id) float64 0.07942 0.6965 0.41 0.4002 0.6484 0.8865 ...
    V          (tri_id) float64 0.844 0.5612 0.4175 0.1812 0.2498 0.1711 ...

I believe this approach is fully flexible and is a more idiomatic use of xarray since each data variable now represents one type of data.

Nov 20 '17 15:11 philippjfr

Some notes for reference, most of the source data will be in the following forms.

1. A mesh definition. i.e.

node_id, x, y ...

element_id, node_id0, node_id1, node_id3 ...

Often for 3D simulations there will be a vertical component as well. The 2D mesh is extruded and has various layers which are the data at a fixed or proportional depths.

the vertical layer definition will typically have layer numbers i.e 1 through 10 say and either:

fixed depths: surface, -2m, -4m -6m etc, in this case depending on the depth at that node, only certain layers will have data, others will have NaN's or some kind of missing data value.

proportional or sigma depths .i.e.: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 which are the fraction of depth relative to the entire depth at that node.

For completeness, the above is for a triangular mesh. If you want to be more general, there is the possibility of quad unstructured meshes and mixed quad/tri meshes as well. Some models use hexagons but that is very rare. Typically in that case the connectivity is given by:

element_id, element_type, node_id0, node_id1, node_id3, node_id4 ...

element_type is tri or quad and node_id4 is ignored for triangles.

2. multiple data files

Data files are basically split up by parameter (salinity, temp etc) and type (int, float, etc) for storage efficiency and data transfer. scalar: time, value0 ... valueN {for each N in node_ids} ...

vector: time1... parameter1 value0 ... valueN {for each N in node_ids} parameter2 value0 ... valueN {for each N in node_ids} time2 ...

OR

time1 value_p1 value_p2 ... value_pN value_pN {for each N in node_ids}

Based on the above, I think the xarray approach is the most feasible but I don't like calling the xyz-> layers.

Dec 06 '17 16:12 dharhas

Looking at being able to plot unstructured and trimeshes, (and maybe others).

I suggest we use the gridded package:

https://github.com/NOAA-ORR-ERD/gridded

The idea behind gridded is to reflect the data model of multiple grid types, and present a common API for working with them.

For the most part, datashader can't use the common API, as it really needs to know the structures of the data, but it can use the data structures that are built in to it:

gridded.pyugrid gridded.pysgrid

That would by you the file reading/ writing processing abilities of gridded.

And you need an data model anyway, so why not use one that already exists?

(and the data structures are pretty much what you're already using anyway)

Jul 14 '18 21:07 ChrisBarker-NOAA