datapack icon indicating copy to clipboard operation
datapack copied to clipboard

Is a method to determine sysmeta formatId, mediaType needed?

Open gothub opened this issue 8 years ago • 2 comments

When a script creates a SystemMetadata object (i.e. when a DataObject is created), the sysmeta formatId must be specified.

Is it advisable to have a method that automatically determines the formatId by file extension or file contents? This is an old problem, with know issues such as reliability.

Is it advisable to have an automated way to determine formatId, or to rely on the user determining and specifying this?

gothub avatar Apr 11 '17 16:04 gothub

@amoeba has a function that tries to guess the formatId. Its nice when it works. It doesn't always work, so we've been discussing whether no default is better than an incorrect guess. Let's discuss further. Maybe Jeanette and Jesse have thoughts on this too.

mbjones avatar Apr 11 '17 20:04 mbjones

Yeah, guess_format_id uses a hard-coded map between D1 format IDs and file extensions: https://github.com/NCEAS/arcticdatautils/blob/master/R/util.R#L79. I threw in a custom routine for NetCDF files that uses the metadata to guess the specific NetCDF version but otherwise things are based on file extension alone.

There are limitations and even major issues:

  • A file may be missing an extension so a guessing routine may decide on a less specific format ID than the user might intend
  • A file may be using a different file extension than expected (e.g. .txt for CSV/TSV) so a guessing routine may decide on a less specific format ID than the user intended
  • Some file extensions would need special handling routines (e.g. XML, NetCDF) which basically results in an arms race between this guessing routine and the D1 formats list. i.e. as we add formats to the CN formats list, this routine needs to be updated

From a user perspective, I have been told the guessing is nice but I don't personally feel like it's really necessary. If the format ID isn't guessed, I think giving users a useful mechanism in R to find the available values would be needed. e.g.,

> magicUploadFunction(my_path)
Error: You must specify the format_id argument when using magicUploadedFunction. Run `formatsList()` to see a list of possible values.
> formatsList()
format_idid                             Name        Type
eml://ecoinformatics.org/eml-2.0.0      EML 2.0.0   METADATA  
eml://ecoinformatics.org/eml-2.1.0      EML 2.1.0   METADATA  
eml://ecoinformatics.org/eml-2.1.1      EML 2.1.1   METADATA  
text/csv                                CSV         DATA

amoeba avatar Apr 11 '17 20:04 amoeba