VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Improvements to the DMR++ parser

Open TomNicholas opened this issue 1 year ago • 4 comments

The DMR++ parser was merged in #133, but there are a few ways it could be improved.

  1. Docs. It's not actually listed anywhere publicly that DMR++ files are supported, not even in the docstring of open_virtual_dataset.
  2. HDF4 support (#216)
  3. Use ChunkManifest.from_arrays, which should increase performance and will reduce reliance on the kerchunk in-memory format (https://github.com/zarr-developers/VirtualiZarr/pull/113#discussion_r1723850532)
  4. Internal code improvements, e.g.: a. Use pathlib module instead of os internally b. Refactor to be more functional, see https://github.com/zarr-developers/VirtualiZarr/pull/113#discussion_r1723848303

cc @ayushnag @betolink

TomNicholas avatar Aug 26 '24 16:08 TomNicholas

I would like to be involved in some of this work. I can definitely work to better understand the complexities of HDF4 and the steps to enable support to HDF4.

Mikejmnez avatar Aug 26 '24 16:08 Mikejmnez

@ayushnag is there a way to identify a DMR++ file automatically? e.g. a file magic?

TomNicholas avatar Aug 27 '24 01:08 TomNicholas

Not to my knowledge. All valid XML files must start with the string "<?xml" however beyond that I think there would need to be some reading of the header tags (e.g. xmlns:dmrpp="http://xml.opendap.org/dap/dmrpp/1.0.0#") to know it is a dmrpp file.

cc @Mikejmnez @jgallagher59701

ayushnag avatar Aug 27 '24 17:08 ayushnag

@ayushnag is right. The first four elements are not be enough to discern between a generic xml from a dmrpp-generated xml.

Mikejmnez avatar Aug 27 '24 20:08 Mikejmnez