echopype icon indicating copy to clipboard operation
echopype copied to clipboard

Account for `conversion` -> `combination` change in `Provenance` attribute

Open b-reyes opened this issue 2 years ago • 1 comments

In PR #806 conversion was changed to combination in combine.py. This causes issues when displaying a combined file because we specifically obtain the version from the Provenance attributes:

https://github.com/OSOceanAcoustics/echopype/blob/47b827a927f73b7e00f3a27e48b6509037a1732a/echopype/echodata/echodata.py#L183-L186

I don't believe this will causes issues with other portions of the codebase.

b-reyes avatar Sep 16 '22 17:09 b-reyes

Based on discussion in #811, I'm changing the Milestone tag to 0.6.4

emiliom avatar Sep 16 '22 20:09 emiliom

This seems also an issue that we are in a position to address once and for all.

leewujung avatar Nov 14 '22 05:11 leewujung

@emiliom and @leewujung discussed this: we will not remove these provenance attributes and just use a long History attribute to capture all the functions applied to the data. Some more thinking required to figure out what to do with the outcome of combine_echodata. We might add a set of equivalent attributes with name combinatino_*.

leewujung avatar May 25 '23 00:05 leewujung

The conversion_* provenance attributes are actually specified by the convention and are "mandatory if applicable or available" (MA). If I tried to change them earlier (#806), that was a mistake! So, we will definitely not remove these attributes.

Adding equivalent combination_* attributes when combining data (echodata objects) may be a good way to go. Then, the version_info method mentioned above can be modified so that it first checks if a conversion_software_version attribute exists; if it does, it'll use it to read the echodata (=echopype) version info, otherwise it falls back to the conversion_software_version.

But before settling on that scheme, I'll ask the SONAR-netCDF4 community about the interpretation of these "conversion" provenance attributes. Specifically, should a combine-data processing step be interpreted as another type of conversion step? In that case, we should just reuse the conversion_* attributes. The SONAR-netCDF4 documents provide too little information to make that determination; they just say that these attributes refer to the "software used to do the conversion". Is combination a type of conversion?

I'll hold off on working on this until PR #1042 is merged. But I'll plan to ask the SONAR-netCDF4 community soon.

emiliom avatar May 25 '23 18:05 emiliom

I think reusing the “conversion” set of attributes to store the “combination” information is potentially problematic, because we will lose the info pertaining to each originally converted files. It will also make it much harder to keep track of what’s converted from the individual raw file and what’s combined afterwards (and therefore does not 1-1 corresponding raw files). I suggest that we just keep “conversion” and “combination” separate.

Also, for distinguishing if a file is previously combined or just converted: in discussing with @lsetiawan on #1042, I thought we can add a global attribute to the Too-level group combine = True or False. This way we can easily work with the files without having to load the metadata from specific groups, and just use the very light weight Top-level group for this information. I have not actually reviewed the PR to see it’s implemented, but even if not, it’s a relatively small change to make.

leewujung avatar May 25 '23 19:05 leewujung

I thought we can add a global attribute to the Too-level group combine = True or False

This is stored in Provenance rather than the Top-level after discussion with @emiliom.

Update from @emiliom, 7/25/23: The name of that new, "boolean" attribute is is_combined.

lsetiawan avatar May 25 '23 19:05 lsetiawan

The conversion_* attributes are defined in the convention as having single entries. eg: conversion_time : 2023-07-27T18:47:02Z. combine_data reads multiple converted datasets, each with its own values for these attributes. Currently it looks like combine_echodata doesn't try to create a list of the attribute values from each input dataset, but it overwrites conversion_time.

I believe that in other cases of global attributes, lists are created from the collection of inputs and variables in the Provenance group are created to hold those lists? Should we do that here? Then the question would be, what values should remain in the single set of Provenance.conversion_* attributes; the most recent one?

emiliom avatar Jul 27 '23 19:07 emiliom

Currently it looks like combine_echodata doesn't try to create a list of the attribute values from each input dataset, but it overwrites conversion_time.

I believe that in other cases of global attributes, lists are created from the collection of inputs and variables in the Provenance group are created to hold those lists? Should we do that here? Then the question would be, what values should remain in the single set of Provenance.conversion_* attributes; the most recent one?

I think combine_echodata currently:

  • overwrites the global attributes of each group using the first value for a given attributes: I think this makes more sense because if we treat a set of files to be 1 entity, "the first" is the value of the first files. So currently the conversion_time after combine_echodata will be conversion_time of the first file. This is in _merge_attribute: https://github.com/OSOceanAcoustics/echopype/blob/d27384ea36c2f6cdf567c40d84bc80f39ff05f27/echopype/echodata/combine.py#L559. There is a NOTE in there about you and I discussing this.
  • put together a
  • does create data variables in Provenance to store attributes from the original (pre-combined) files into data variables in the Provenance group, with the source echodata group stored as an attribute. This is in _capture_prov_attrs: https://github.com/OSOceanAcoustics/echopype/blob/d27384ea36c2f6cdf567c40d84bc80f39ff05f27/echopype/echodata/combine.py#L604

I think previously we talked about potentially having a combination_time that is separate from conversion_time. What do you think about that?

leewujung avatar Jul 28 '23 22:07 leewujung

I think previously we talked about potentially having a combination_time that is separate from conversion_time. What do you think about that?

Yes, that's still the plan.

emiliom avatar Jul 29 '23 20:07 emiliom

Addressed in #1113. Closing the issue

emiliom avatar Aug 04 '23 17:08 emiliom