eml dictionary needed for externallyDefinedFormat

Author Name: Peter McCartney (Peter McCartney) Original Redmine Issue: 1197, https://projects.ecoinformatics.org/ecoinfo/issues/1197 Original Date: 2003-10-31 Original Assignee: Matt Jones

Externally defined format is useless for automatic processing unless you have some idea what to look for. This is a step backwards from FGDC which at least provided enumerations for the common file formats at the time.

Mar 12 '17 02:03 mbjones

Original Redmine Comment Author Name: Peter McCartney (Peter McCartney) Original Date: 2003-12-17T18:36:51Z

Here is a possible format for a dictionary file to provide an anuthority and reference for data formats (and archive formats)

Mar 12 '17 02:03 mbjones

Original Redmine Comment Author Name: Peter McCartney (Peter McCartney) Original Date: 2003-12-17T20:19:09Z

Mar 12 '17 02:03 mbjones

Original Redmine Comment Author Name: Peter McCartney (Peter McCartney) Original Date: 2003-12-17T20:45:21Z

ok the issue seems to be

we need a controlled enumeration for externallyDefinedFormat that is both recongized by users and parsable by applications
mime types were created to serve this purpose. Project alexandria investigated this and decided that a combination of both format name and mime types was needed, since the appropriate mime type is not always adquate. Read http://www.alexandria.ucsb.edu/middleware/dtds/ADL-access-report.dtd to see their discussion. Basically they provide three elements i their metadata schema for downloads - format, mime, and encoding.
vendors are slowly adding mime types but very few scientific data formats have been added. if we define mimes for these formats we could register them only by putting an x- in front of it. and of course these definitions would be depracated when the owner puts in a definition.
dataFormat is required, so if the data are in Oracle, we need to have SOMETHING to put here, even if the information is superflous once connectionDefinition is filled out. its not clear to me if mimes even apply to connections - perhaps these are all octet-streams?
going beyond the enumeration issue, if we were to adopt a dictionary, we have the option of storing other metadata on a format that could be useful. the example i show here lists each part of a multipart format and its mime type. we use a file similar to this in our Xylopia data service to determine what parts of a file format need to be gathered up into the zip package. in my example it lists extensions which works fine for dealing with shapefiles, dbf, mapinfo, geoTiff, and so on. the only other multipart type that does not use extensions to identify its parts is arcinfo coverages. in this case the rules rely on foldernames and filenames under those folders to handle the different parts. because coverages within one folder share a common metadata folder, you can not move coverages by zipping up the files.. you must open it and save it as some other format for transport.

there was some debate about the utility of this multipart info, so im willing to table that part of the issue and continue to do it internally ourselves. but it would be really nice if we could agree how to ensure that shapefile will always be shapefile and not Shapefile, shape file, shape, esrishapefile...etc.

the attachment i put in (and edited) was an example of such a dictionary showing how multiprt, single part and service formats could all be handled using a strategy similar to ADA where we define format types, and then list the mimes for each of the parts. Matt felt this was inappropriate as there is in fact a multipart mime type. so a variant on this would be to put the mime attribute in the externallyDefinedElement tag rather than in the part tag (or both). the nice ting about this is that like stmml.xsd, it abstracts users from complicated terminology yet does enable maching processing through mime types when they exist. if we leave the mime element out of eml, then the dictionary can have the most up-todate mime for any given format and we dont have to edit eml files when new mime types appear.

Mar 12 '17 02:03 mbjones

Original Redmine Comment Author Name: Matt Jones (Matt Jones) Original Date: 2004-09-02T16:38:09Z

Changing QA contact to the list for all current EML bugs so that people can track what is happening.

Mar 12 '17 02:03 mbjones

Original Redmine Comment Author Name: Redmine Admin (Redmine Admin) Original Date: 2013-03-27T21:16:29Z

Original Bugzilla ID was 1197

Mar 12 '17 02:03 mbjones

A working group of LTER information managers developing best practices for non-tabular data is bumping into the issues described above regarding a lack of a controlled list for formatName. We haven't found an authoritative source that encompasses the bulk of our non-tabular types. It seems like several folks have tried, and we didn't want to just make up yet another list. But having a list like with the units list in EML would be helpful. +1 for this enhancement, and I hope this issue eventually gets some traction.

Aug 14 '20 18:08 twhiteaker

Can someone clarify the intent of formatName? Should it be a mime type or more human friendly? E.g., "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" (mime type from Mozilla) vs. "Microsoft Excel" (format name from EML example).

Aug 18 '20 16:08 twhiteaker

Good question Tim. For MCR LTER datasets, I use the formatName element in at least 22 different datasets (only counting most recent revisions). In all of those I used a human-readable label, not a mime type. I cannot verify what I did was best practice, but I likely asked someone-who-knows at the time. Here are my examples: knb-lter-mcr.1036.5.xml: <formatName>png image</formatName> knb-lter-mcr.1041.1.xml: <formatName>Text File</formatName> knb-lter-mcr.5003.xml: <formatName>zip of JPG files</formatName> knb-lter-mcr.5004.xml: <formatName>zip of JPG files</formatName> knb-lter-mcr.5006.xml: <formatName>zip of 671 JPEG image and 671 text files</formatName> knb-lter-mcr.5006.xml: <formatName>zip of 689 JPEG image and 689 text files</formatName> knb-lter-mcr.5006.xml: <formatName>zip of 695 JPEG image and 695 text files</formatName> knb-lter-mcr.5013.xml: <formatName>zip archive</formatName> knb-lter-mcr.5018.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5020.0.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5020.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5021.10.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5024.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5025.0.xml: <formatName>Microsoft Office Excel xlsx</formatName> knb-lter-mcr.5031.0.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5031.10.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5032.1.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5036.1.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5037.10.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5038.10.xml: <formatName>Text File</formatName> knb-lter-mcr.5038.1.xml: <formatName>Text File</formatName> knb-lter-mcr.5039.10.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5039.1.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5040.10.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.5040.11.xml: <formatName>MS-Excel</formatName> knb-lter-mcr.6002.xml: <formatName>application/zip</formatName> knb-lter-mcr.6002.xml: <formatName>png</formatName> knb-lter-mcr.6003.xml: <formatName>application/zip</formatName> knb-lter-mcr.6003.xml: <formatName>jpg</formatName> knb-lter-mcr.6004.xml: <formatName>application/zip</formatName> knb-lter-mcr.6004.xml: <formatName>jpg</formatName>

On Tue, Aug 18, 2020 at 9:09 AM Tim Whiteaker [email protected] wrote:

Can someone clarify the intent of formatName? Should it be a mime type or more human friendly? E.g., "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" (mime type from Mozilla https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types) vs. "Microsoft Excel" (mime type from EML example https://eml.ecoinformatics.org/schema/eml-physical_xsd.html#PhysicalType_PhysicalType_dataFormat_PhysicalType_PhysicalType_dataFormat_externallyDefinedFormat_formatName ).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/136#issuecomment-675571597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGTM6FTR6UYMNHCAWHAZSSLSBKRTFANCNFSM4DDJU4NQ .

Aug 18 '20 18:08 gastil

@gastil thanks for the examples. application/zip in the later examples looks like a MIME type, compared to "zip archive". Also there's Microsoft Office Excel xlsx versus MS-Excel in those examples. It highlights the need for guidance or a reference list to choose from. :)

Aug 19 '20 17:08 twhiteaker

@twhiteaker we ended up tackling this at DataONE with an extensible formats service, and I think it would be great if we adopted the controlled vocabulary that we use there for entity type names in EML. See: https://cn.dataone.org/cn/v2/formats In DataONE we've had extensive discussions of this issue and the relationships to other type services like ProNOM and UDFR and GDFR, among others. I can dig up those threads if its useful, but the outcome was that we needed a simpler and extenisble list, and now all DataONE member repositories type their objects using that list (and we extend it as needed for new types).

We ended up externalizing this type to SystemMetadata in DataONE so that it would work across standards (e.g., see one of your CSV files here: https://cn.dataone.org/cn/v2/meta/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-ble%2F14%2F3%2F3983d2a3c44e301acb6d4d37af38e525), but I also think that it would be best if there was a correspondence with EML entity formatName. We felt that MIME type was often not specific enough, although sometimes it was. The controlled vocabulary of entity types that we maintain for DataONE is extensible. And it includes the associated MIME-type as a sub field (look at the XML output of the service (not just the HTML formatted display).

Aug 19 '20 18:08 mbjones

@mbjones that looks promising. What would be the process for adding an item to that vocab? For example, we'd want an entry for shapefiles, and a shapefile would likely be distributed as a zip archive of the various files that make up a shapefile. How do you go about determining what should go in Id, name, type, and media type? I picked a complicated one on purpose, since you already have an entry for zip file, and a zipped shapefile is really a shapefile that happens to be packaged as a zip file.

If an entry was added, would you migrate the version from 2 to 2.1? Or would you collect several entries before incrementing the version number?

Although LTER already has is own controlled vocabularies, I prefer to use an existing vocab if it can work for us rather than invent our own. This could be an interesting use case for how LTER IMs and DataONE could work more closely together. So I'm curious to learn how this interaction would work.

Aug 20 '20 17:08 twhiteaker

@srearl We track requests for changes in our issue trackers (mostly github now, but we still have some older repositories still in Redmine and SVN, which applies here). For new format requests, those typically occur when we are onboarding new repositories and they have content types we haven't seen before, and they get new formats added. In the case of shapefiles, this was an oversight which we noted years ago, but still hasn't been added but should be ASAP (https://redmine.dataone.org/issues/6883). I have asked @taojing2002 to move forward with that and get the entries added.

The v2 in the service URI for the formats service is the version of the API, not the version of the content. So, the v2 will stay the same. Format identifiers are never removed, only new ones are added, although we do at times correct spelling errors in the metadata describing the format, or add new mime types or extensions. Here's a little snippet of the format list so you can see how sometimes the formatId is more granular than the associated mime types (e.g., many formats can share the same MIME type like text/xml.

    <objectFormat>
        <formatId>image/tiff</formatId>
        <formatName>Tagged Image File Format</formatName>
        <formatType>DATA</formatType>
        <mediaType name="image/tiff"/>
        <extension>tiff</extension>
    </objectFormat>
    <objectFormat>
        <formatId>http://rs.tdwg.org/dwc/xsd/simpledarwincore/</formatId>
        <formatName>Simple Darwin Core</formatName>
        <formatType>METADATA</formatType>
        <mediaType name="text/xml"/>
        <extension>xml</extension>
    </objectFormat>
    <objectFormat>
        <formatId>http://digir.net/schema/conceptual/darwin/2003/1.0/darwin2.xsd</formatId>
        <formatName>Darwin Core, version 2.0</formatName>
        <formatType>METADATA</formatType>
        <mediaType name="text/xml"/>
        <extension>xml</extension>
    </objectFormat>

When a MIME type is super clear and doesn't support multiple subtypes (like for text/csv files), we try to make the formatId and the mediaType fields both use the MIME type as the designator. That doesn't work for polymorphic mime types like text/xml. We choose the formatName to be a human readable and descriptive string that includes version info when appropriate. The formatId is generally set to the established mime type or namespace for the format where one has been previously been established (e.g., in a specification). Otherwise, we try to use a URI pointing at the specification for the format, or at another descriptive location for it. Finally, formatType is one of METADATA, DATA, or RESOURCE, which is used within DataONE to determine harvesting and indexing workflows for that type of file, and mostly can be ignored by other groups (although, it is useful in DataONE SOLR queries too).

Aug 20 '20 21:08 mbjones

@mbjones So if an LTER IM wanted to contribute, all we'd need is a DataONE account, and then we'd see the option to post a new issue, right? (I didn't see a way to create an account. Also, it looks like you have 702 issues in Redmine!)

What can an IM expect as far as timing and process when submitting a new or edited data type? An example of what I'm talking about is the set of rules for changes to the CF Conventions, in which each proposal gets a moderator, a discussion period, and has to have approval of three folks on the committee, etc.

I like that this service comes from DataONE. Before I recommend to our non-tabular working group that we go with this, I want to make sure there's a good pathway for us to contribute, and that we won't get lost in the shuffle of 702 issues. We could have a few dozen additional formats we need as well (I haven't formally compared your list to what LTER sites actually use yet), so we'd want to get them added efficiently.

There are EML implications as well, since what I believe this implies is that there would be no controlled vocabulary for formatName in EML, but rather the user would be encouraged to grab a formatName from DataONE.

Aug 21 '20 20:08 twhiteaker

hi @twhiteaker as mentioned, we are partway through migrating from redmine to github. So, I would prefer to not add new people and processes to Redmine. I will talk to Dave about migrating this request process to GitHub as well. Generally there is little to be debated on these proposals -- all we typically look for is to ensure that 1) the format is not already registered, and 2) if the proposed formatId is the most sensible it can be given the format specification. New IDs can generally be incorporated quickly hours or a few days. Up until now we've reviewed these within the cyberinfrastructure team, but are open to broader review as well, as long as the process doesn't become overly burdensome. Ultimately, its just a controlled list of formats, and a little research will indicate if a proposed addition is sensible. The CF conventions are far more subjective, and need that more elaborate governing process.

It would be really nice if EML allowed a reference to the controlled URI for the formatId, but that would require a schema change. Possible, but not sure if it merits the disruption. Let's discuss and get further input. You could use the annotation field as well to record the controlled term for the entity.

Aug 21 '20 22:08 mbjones

OK, I think this will work for us then. I'm going to suggest to the non-tabular working group that we use, and contribute to, DataONE's extensible formats service as a controlled set of formatNames for use in EML.

Is there an open issue about migrating the request process to GitHub? Really, I'm just looking to "subscribe" so that I can be aware once that process is completed; it doesn't matter how I'm kept in the loop. Then we IMs can start suggesting new terms.

Aug 24 '20 15:08 twhiteaker

@twhiteaker Great! We migrated the format list to GitHub here:

Format repository: https://github.com/DataONEorg/object-formats
Example request for shapefile format: https://github.com/DataONEorg/object-formats/issues/3

I'm still working on describing the process, but it will be updated in the README when completed. Suggestions appreciated.

Aug 24 '20 22:08 mbjones