schema_salad icon indicating copy to clipboard operation
schema_salad copied to clipboard

RFE: fetch directly JSON from HDF5 files?

Open ankostis opened this issue 6 years ago • 8 comments

Would it be possible with slight changes in the resolving logic to support fetchers that produce YAML directly?

What i have specifically in mind is to read HDF5 files, either with pandas, or using the client-library of the newly published HDFServer+HDFJson standards. For both, i would need to:

  • craft "deep-linking" URLs into the HDF5 files,
  • modify the fetch-procedure to use one of the 2 access methods.

What would be the API changes needed to accomdate this?

ankostis avatar May 26 '19 16:05 ankostis

Hello @ankostis and thank you for your proposal.

If there is already a REST/HTTP API then there is no change needed to schema-salad as it already supports HTTP(S) URLs.

mr-c avatar May 27 '19 22:05 mr-c

Hmm...yes you're right. I forgot to mention that i wanted this to work locally, from a file:// url.

ankostis avatar May 28 '19 07:05 ankostis

Would this be for just the initial input data to a CWL workflow, or did you envision this access pattern being used between steps or at the end?

The first case is possible today, with local REST endpoint serving the input data.

mr-c avatar May 28 '19 08:05 mr-c

I have just a single workflow step with complex input data, so i figured that i can use schema-salad alone, like jsonschema on steroids. Does that make sense?

In any case, the salad will be parsed internally in my process, and i'm looking for the optimum way to patch this library so as to support extracting data from different binary file types, for which, fetch_text() does not make sense. And i want your opinion if this is a totally doomed direction.

ankostis avatar May 28 '19 11:05 ankostis

@ankostis The quick solution is to inject a custom Fetcher using fetcher_constructor of Loader. You would then implement your custom fetch_text() which might involve reading the binary file, serializing to JSON, and returning the string to schema salad to be re-parsed, processed and validated.

A more complete solution might be to optionally move the parsing over to the other side of the Fetcher interface, adding something like a fetch_structured() method. However in order to support line numbers in error reporting, schema salad has an assumption of ruamel.yaml types (CommentedMap and CommentedSeq) so you can't just return plain Python dicts and lists.

tetron avatar May 28 '19 14:05 tetron

Great answer. Thanks. For the 1st case, wouldn't that mean that it will slightly waste cpu-cycles, since json would parse twice, once in each side of the fetcher interface?

ankostis avatar May 28 '19 16:05 ankostis

Yes, it is somewhat inefficient, but worth trying as a proof of concept and then once it works looking at optimization.

tetron avatar May 28 '19 16:05 tetron

Interesting discussion. I ran into this while considering how to integrate salad-based metadata into Zarr datasets. Very briefly, Zarr provides a very similar data structure as HDF5 but does so via multiple files. Relevant for this discussion: the metadata is stored as separate JSON files which would be loadable via file:/// without the need for a service.

joshmoore avatar Apr 21 '21 14:04 joshmoore