DimensionalData.jl icon indicating copy to clipboard operation
DimensionalData.jl copied to clipboard

Materialize `DimArray` or `DimStack` From a Table

Open JoshuaBillson opened this issue 1 year ago • 18 comments

Description:

resolves #335

This PR aims to let users construct either a DimStack or DimArray from a table with one or more coordinate columns.

Unlike the existing contructor, rows may be out of order or even missing altogether.

Performance:

The algorithm is O(n), requiring two forward passes for each dimension to determine the correct order of rows.

1000x1000: 0.005 Seconds

2000x2000: 0.025 Seconds

4000x4000: 0.108 Seconds

8000x8000: 0.376 Seconds

Example:

julia> r = DimArray(rand(UInt8, 1000, 1000), (X, Y));

julia> t = r |> DataFrame |> Random.shuffle!;

julia> restored = DimArray(t, dims(r));

julia> all(r .== restored)
true

Next Steps:

  1. Finalize method signatures.
  2. Decide what to do with missing rows. Currently, users may choose a missing value to fill in (missing by default).
  3. Add support for a :geometry column.
  4. Write test cases.
  5. Update docs.

JoshuaBillson avatar Jun 18 '24 19:06 JoshuaBillson

There's still a few questions that need resolving:

  1. What do we do with missing rows? This can be handled explicitly in Rasters.jl, but DimArrays have no concept of missing values. We could let users choose a missing value to write, or we could encode it explicitly with missing. We could also disallow the existence of missing rows, but I don't think this is an ideal solution.
  2. How should we handle a :geometry column? I know that several packages in the Julia ecosystem are using this convention, but I don't have much experience with them. Can we assume :geometry will contain tuples of coordinates, or could they also be some sort of geometry like Meshes.Point?

JoshuaBillson avatar Jun 20 '24 07:06 JoshuaBillson

I think it's best to test that the geometry column's elements have GI.PointTrait - then you can always extract e.g. GI.x(point) and the same for y, z, and m. You can also interrogate the dimension via GI.is3d, GI.ismeasured.

Going forward to get the geometry column it's probably best to have first(GI.geometrycolumns(table)) - the fallback implementation will give you (:geometry,), but to handle tables with other geometry columns which may be indicated by e.g. metadata, it's better to use this.

asinghvi17 avatar Jun 20 '24 14:06 asinghvi17

DimensionalData.jl does not depend on GeoInterface

rafaqz avatar Jun 20 '24 15:06 rafaqz

DimensionalData.jl does not depend on GeoInterface

Being able to interact with geometries in a generic fashion would help us interop with other packages. GeoInterface's only dependency is Extents, which we already depend on. After importing DimensionalData, GeoInterface loads in 0.008 seconds.

I also see that Rasters already depends on GeoInterface. We could simply ignore the :geometry column in DimensionalData, then implement the additional functionality in Rasters. However, I think the problem is general enough to be implemented here.

JoshuaBillson avatar Jun 20 '24 18:06 JoshuaBillson

It's not about deps or timing, it's about clean feature scope.

The promise here is that Rasters (and YAX) have all the geo deps and features, so non-geo people don't have to worry about them. Some of the biggest contributors here use DD for unrelated fields.

I would ignore point columns here entirely and instead write the code so it's easy for Rasters to handle them. Taking the underscores off a few functions and documenting them as an real interface will help that.

rafaqz avatar Jun 20 '24 18:06 rafaqz

I've updated the docstrings for both the DimArray and DimStack constructors. By default, materializers will now use the Contains selector for irregular and non-numeric coordinates, with the option to specify an alternative when desired. We're exporting the restore_array and coords_to_index methods to be used by downstream packages like Rasters.jl. I've also written several test cases to show that we can handle tables with out of order and missing rows. If everything looks good, I think we can go ahead and merge this PR.

JoshuaBillson avatar Jul 05 '24 23:07 JoshuaBillson

I've been wondering if we can also detect the dimensions?

So find the minimum and maximum values and the smallest step (after sorting)? Then check if the step is consistent and choose Regular/Irregular?

rafaqz avatar Aug 08 '24 07:08 rafaqz

I've been wondering if we can also detect the dimensions?

So find the minimum and maximum values and the smallest step (after sorting)? Then check if the step is consistent and choose Regular/Irregular?

I think it should be possible to do this, but we need to consider how to handle floating point error. I've had issues in the past where cutting a DimArray into partially overlapping tiles results in the same coordinate having slightly different values in each tile. Here's an example that I ran into recently when working with some Sentinel 2 imagery:

julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));

julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));

julia> d = DimArray(rand(Float32, 2560, 2560), (xdims, ydims));

julia> tile_1 = d[1:16,1:16];

julia> tile_2 = d[8:23,8:23];

julia> collect(dims(tile_1, Y))[8:end-1]
8-element Vector{Float64}:
 6.84128e6
 6.84126e6
 6.84124e6
 6.84122e6
 6.841200000000001e6
 6.841180000000001e6
 6.841160000000001e6
 6.841140000000001e6

julia> collect(dims(tile_2, Y))[1:8]
8-element Vector{Float64}:
 6.84128e6
 6.84126e6
 6.841240000000001e6
 6.84122e6
 6.8412e6
 6.84118e6
 6.84116e6
 6.84114e6

If we tried to estimate the step size as you propose, we would end up with 9.313225746154785e-10, which is the smallest difference between two adjacent coordinates. Floating point error could also throw off our estimate of the minimum and maximum values. We probably need to introduce some sort of tolerance parameter to handle these cases.

Another issue is how to determine if coordinates are regular or irregular, considering that some coordinates could be missing. Floating point error would also make a naive solution impractical. For example, if our step size varies by 1e-10, it's probably due to floating point precision and not because our coordinates are irregular. We also need to consider categorical values like Symbols or Strings.

JoshuaBillson avatar Aug 08 '24 10:08 JoshuaBillson

Yeah we probably need a tolerance check.

We could allow specifying if some dimensions are not sorted, like dims=(X=ForwardOrdered(), Y=ReverseOrdered, C=Unordered()) so categories could just be taken as they come.

rafaqz avatar Aug 08 '24 11:08 rafaqz

I've implemented the guess_dims method to infer dimensions directly from the data. It should work for both sorted and shuffled data and can handle missing rows. If the data is sorted, you shouldn't need to specify the ordering, since it can be inferred directly. In the case of shuffled data, you will need to give the order as a Dim => Order pair. If the names of your coordinate columns can be safely guessed (currently :X, :Y, :Z, :Ti, and :Band) and the data is sorted, then we should be able to infer the dimensions without any input from the user.

Here's an example of it in action:

julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));

julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));

julia> bdims = Dim{:Band}([:B02, :B03, :B04]);

julia> d = DimArray(rand(UInt16, 2560, 2560, 3), (xdims, ydims, bdims));

julia> t = DataFrame(d);

julia> t_rand = Random.shuffle(t);

julia> dims(d)
↓ X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t)
↓ X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t, (X, Y, :Band))
↓ X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t_rand, (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
↓ X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t_rand[1:10000,:], (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
↓ X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
→ Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

As can be seen in the last example, guess_dims works even when we select only a subset of the shuffled rows.

JoshuaBillson avatar Aug 12 '24 00:08 JoshuaBillson

This is awesome to have, thanks.

It may be a bit before I can fully review and test it but it looks good from a quick skim.

rafaqz avatar Aug 14 '24 09:08 rafaqz

Any news on this PR @JoshuaBillson ? would be good to have this available

rafaqz avatar Sep 15 '24 10:09 rafaqz

Any news on this PR @JoshuaBillson ? would be good to have this available

Sorry, I've been busy with some other projects. I'll try to resolve your suggested fixes this week. Once that's done, I think we just need to update the docs for DimStack and DimArray and write some more test cases.

JoshuaBillson avatar Sep 18 '24 06:09 JoshuaBillson

No worries at all, whenever you have time.

There are just some nice consequences of having this, like loading GeoJSON to a vector data cube, so I'm keen to have it.

rafaqz avatar Sep 18 '24 07:09 rafaqz

How close is this PR to the finish line? I'm doing some dataframes-heavy things right now and found myself in a situation a few times where this would been super helpful. Haven't looked into the nitty-gritty of the code, but this looks almost ready. I can do some final polishing if that's all it takes to make this available.

tiemvanderdeure avatar Mar 19 '25 13:03 tiemvanderdeure

Maybe try it out? I guess it needs a rebase too. I haven't had time or a use case yet to test it so not sure if it's finished

rafaqz avatar Mar 19 '25 16:03 rafaqz

Maybe try it out? I guess it needs a rebase too. I haven't had time or a use case yet to test it so not sure if it's finished

I believe it was 90% finished. We mostly need test cases and documentation. We'll also need to rebase, as you mentioned.

One thing I noted when reading the new documentation is the inclusion of AutoValues(), AutoOrder(), and AutoSpan() to automatically infer dimensional lookups from an array of values. Could we reuse this logic here to infer the values, order, and span instead of using custom methods? This would remove redundancy from the codebase and ensure consistency in how lookups are detected.

JoshuaBillson avatar Mar 20 '25 01:03 JoshuaBillson

Okay the rebase wasn't too hard.

There are a few method ambiguities - I'll sort those out.

Definitely a good idea to use the automatic detection of lookup types using format. It would also be really neat if it were possible to pass things like X(Sampled(order = ForwardOrdered())) to force the dimensions to look a certain way, where other fields are inferred from the data. Similar to dimensions how you pass partially formatted dimensions to rand

tiemvanderdeure avatar Mar 20 '25 07:03 tiemvanderdeure

Okay a question here is: how should this work for AbstractArray{<:NamedTuple}. Since this is both a table and an array?

Maybe we can materialize it as a table if DimArray is called without a dims argument, or if the dims are in also columns, and just make a DimArray{<:NamedTuple} otherwise?

tiemvanderdeure avatar Apr 29 '25 08:04 tiemvanderdeure

Can you maybbe demonstrate that with a MWE? I don't think I totally get the problem

rafaqz avatar Apr 29 '25 09:04 rafaqz

Now locally I've made it so that it checks if the names of the dimensions are in the columns or not. This ensures that all of this works:

x = X([:a, :b, :c])
y = Y([10.0, 20.0])
d = Dim{:test}(1.0:1.0:3.0)
dimz = x, y, d
da = DimArray(ones(3, 2, 3), dimz; name=:data)
table = Tables.rowtable(da)
DimArray(table, dimz) # returns a DimArray with x, y, test dimensions
DimArray(table, (X, Y, :test)) # same as the line above
DimArray(table, Z(1:18)) # returns a DimVector{<:NamedTuple} with a Z dimension

I'll push it later this afternoon and then we can discuss.

Another case is if you call DimStack on a NamedTuple{<:AbstractVector}. Now do the elements of the namedtuple represent layers of the stack, or columns of the table?

tiemvanderdeure avatar Apr 29 '25 10:04 tiemvanderdeure

Is the NamedTuple not both the dims and the layers like another table would be?

rafaqz avatar Apr 29 '25 19:04 rafaqz

Is the NamedTuple not both the dims and the layers like another table would be?

Yes but there are edge cases where this is breaking. E.g.

nt = (X = 1:10, data = 1:10)
DimStack(nt, X(1:10))
julia> DimStack(nt, X(1:10))
┌ 10-element DimStack ┐
├─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:10 ForwardOrdered Regular Points
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── layers ┤
  :X    eltype: Int64 dims: X size: 10
  :data eltype: Int64 dims: X size: 10
└─────────────────────────────────────────────

Whereas on https://github.com/JoshuaBillson/DimensionalData.jl/pull/2 this returns

julia> DimStack(nt, X(1:10))
┌ 10-element DimStack ┐
├─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:10 ForwardOrdered Regular Points
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── layers ┤
  :data eltype: Int64 dims: X size: 10
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

tiemvanderdeure avatar Apr 30 '25 08:04 tiemvanderdeure

Ok so maybe we need to check for the dimensions columns and remove them? And have a keyword to not do that

rafaqz avatar Apr 30 '25 18:04 rafaqz

We could add an astable keyword for both of these cases (NamedTuple of Vector and Vector of NamedTuple)? Then if that it set to false it would never interpret a table as a table, basically preserving the previous behaviour. And it would default to true, which is to treat tables as tables.

EDIT: another ambiguous case where we don't want to treat the input as a table is when calling DimArray on a DimStack. This is a different dispatch though, but just shows it's hard to make this consistent.

tiemvanderdeure avatar May 05 '25 07:05 tiemvanderdeure

I just merged the commits from https://github.com/JoshuaBillson/DimensionalData.jl/pull/2 since we're not hearing from Joshua.

The merge conflicts are a bit tricky, actually. Because of the changes in https://github.com/rafaqz/DimensionalData.jl/pull/990. Git seems to be putting some of the code from this PR in illogical places.

tiemvanderdeure avatar May 05 '25 08:05 tiemvanderdeure

Should I try to resolve it?

rafaqz avatar May 05 '25 08:05 rafaqz

If you want to give it a go? Otherwise I can take a closer look at what changed in your PR but I don't know when I will have time

tiemvanderdeure avatar May 05 '25 10:05 tiemvanderdeure

There are some method ambiguities - I'll fix those.

Before I do that - I'm wondering if we can use abstract type for many of these constructors like (::Type{T})(args...; kw...) where T <: AbstractDimArray. The ambiguities are basically because we do that in some places and not in others. And it will also save us from copying lots of code over to Rasters.

Or does that not make sense for YAX?

tiemvanderdeure avatar May 06 '25 11:05 tiemvanderdeure

Those constructors are kind of dangerous for ambiguity if you want to change anything, so I'm not sure

rafaqz avatar May 06 '25 14:05 rafaqz