DiskArrays.jl icon indicating copy to clipboard operation
DiskArrays.jl copied to clipboard

Implementing in other packages

Open meggart opened this issue 5 years ago • 19 comments

I think the general indexing functionality for this package is there now, so that one could use this package as a dependency for NetCDF, HDF5 or Zarr without losing functionality. Here we can colltect a list to the packages that might profit from DiskArrays and link to the respective PRs:

  • [x] NetCDF.jl https://github.com/JuliaGeo/NetCDF.jl/pull/111
  • [x] Zarr.jl https://github.com/meggart/Zarr.jl/pull/24
  • [ ] NCDatasets.jl
  • [x] HDF5.jl (Basically through HDF5Utils.jl)
  • [x] ArchGDAL.jl https://github.com/yeesian/ArchGDAL.jl/pull/105

@visr @rafaqz @Alexander-Barth any suggestions and feedback would be highly appreciated

meggart avatar Jan 14 '20 09:01 meggart

Nice work! Do you think it would also make sense to apply this for GDAL raster grids in https://github.com/yeesian/ArchGDAL.jl, or at least https://github.com/evetion/GeoArrays.jl? @yeesian @evetion.

visr avatar Jan 14 '20 09:01 visr

Thanks for the feedback. Yes, both of them should definitely be on the radar. I just looked at the ArchGDAL raster interface and implementation there should be straightforward.

meggart avatar Jan 14 '20 15:01 meggart

Great. I guess GeoArrays.jl can be removed from the list. Right know it only supports reading the entire grid. And if it would add support for subsets, since it uses ArchGDAL for IO, it could profit from ArchGDAL's use of DiskArrays.

visr avatar Jan 14 '20 15:01 visr

This looks good. Seems pretty comprehensive for indexing it will be great to have everything standardised.

My vote is also to implement it in ArchGDAL, then GeoData.jl can use it too. I recently added some basic array indexing to ArchGDAL, we could build off that.

rafaqz avatar Jan 14 '20 22:01 rafaqz

Ok, I created an initial PR for ArchGDAL. Maybe it would be a good idea to start registering ChunkedArrayBase and DiskArrays so that testing gets easier for dependent packages?

meggart avatar Jan 15 '20 13:01 meggart

Great! Yeah I think registering may be good. Since both ChunkedArrayBase and DiskArrays are pure julia, no-deps, small projects, why do you prefer to split it into two packages? May be a good choice, but just wondering.

visr avatar Jan 15 '20 13:01 visr

I don't know what the best strategy would be here. I thought there might be packages that want to use the concept of chunking (maybe some DistributedArray-like package) that don't have the slow-access characteristics to be tackled here, but from a maintenance point of view I think you are right and it is better to merge these packages. I will copy the code from ChunkedArrayBase here then.

Are there any suggestions where DiskArrays should live? Is JuliaGeo the right place for it or should it stay in my namespace for now.

meggart avatar Jan 15 '20 13:01 meggart

Ok yeah then I agree on merging them.

Since it is potentially quite a foundational package with many reverse dependencies, placing it in an organization is probably a good idea. Since it is created for and often used with geospatial rasters, JuliaGeo could be a good place for it to live. Another good option would be in https://github.com/JuliaArrays, living along the other special Array types with package names *Arrays.jl.

visr avatar Jan 15 '20 14:01 visr

I was also wondering if it should be in JuliaArrays... HDF5 is a lot more general than geospatial data, and there are probably other non geo applications.

rafaqz avatar Jan 15 '20 19:01 rafaqz

I think since testing the PRs depends so much on this package being registered, I decided to register the package as it is for now. We can still move to an org later.

meggart avatar Jan 28 '20 14:01 meggart

Support for HDF5.jl would be great!

oschulz avatar Feb 03 '20 15:02 oschulz

@oschulz I just opened https://github.com/JuliaIO/HDF5.jl/issues/615 to discuss this

meggart avatar Apr 02 '20 13:04 meggart

Oh, thanks - almost forgot about this! I should also check if I can use this in UpROOT.jl.

oschulz avatar Apr 02 '20 13:04 oschulz

I've implemented this in https://github.com/evetion/GeoArrays.jl, based on the ArchGDAL.jl implementation. Only thing that it breaks is my support for missing values, for which I normally read the arrays into memory and convert them to Array{Union{Missing, T}}.

But Missing support is hard, the GDAL ecosystem supports a nodata value as well as complete masks.

evetion avatar Jan 02 '21 16:01 evetion

Great to hear about the GeoArrays implementation. Is there anything specific that could be done about the missing values on the DiskArrays side. I am struggling with this now and then as well and have some attempts at solving this e.g. here https://github.com/meggart/DiskArrayTools.jl/blob/6728512ed9c33e1fa261bf38b2de3b97589ce310/src/DiskArrayTools.jl#L155-L192 where the conversion from T -> Union{T,Missing} happens inside the readblock! function for CFDiskArrays, creating a copy of the data as well.

meggart avatar Jan 04 '21 10:01 meggart

Last time I tried, DiskArrays in NCDatasets I had some issues with types of undefined size like String and probable also NetCDF4 variable-length arrays:

julia> sizeof(String)
ERROR: Type String does not have a definite size.
Stacktrace:
 [1] sizeof(::Type{T} where T) at ./essentials.jl:449
 [2] top-level scope at REPL[3]:1

See here: https://github.com/Alexander-Barth/NCDatasets.jl/issues/79#issuecomment-632602971

Can DiskArrays now be used with an arrays of Strings?

Alexander-Barth avatar Jan 05 '21 09:01 Alexander-Barth

@Alexander-Barth have you seen and tried this: https://github.com/meggart/DiskArrays.jl/pull/24 sorry for not notifying you earlier, let me know if this works for you.

meggart avatar Feb 23 '21 11:02 meggart

Thanks a lot @meggart for let me know. I hope that I can find time soon to test this, but right now it is a bit difficult for me.

I had just a look to your patch:

"""
A fallback element size for arrays to determine a where elements have unknown
size like strings. Defaults to 100MB
"""
const fallback_element_size = Ref(100)

Should this be 100 bytes in doc strings ?

Alexander-Barth avatar Feb 24 '21 14:02 Alexander-Barth

@visr thank you for looking back into this issue in your test here. I am wondering if there is still the plan to have variables with unlimited dimensions (I think this is in the branch resizable). For NCDatasets with the DiskArrays branch resizable I am hitting this issue:

https://github.com/Alexander-Barth/NCDatasets.jl/issues/79#issuecomment-869652186

Alexander-Barth avatar Jun 28 '21 12:06 Alexander-Barth