metacat
metacat copied to clipboard
metacat sets filename incorrectly during download via getObject
In theory we already fixed this issue of file naming for downloads from Metacat in the closed ticket #1174, but we got a verifiable case of this today on the KNB (see RT ticket 25950). It may be that we did not fully fix the problem. The issue was reported in RT ticket , where the user reported downloading two data files that were given filenames that did not correspond to their name in the metadata, which prevented them from knowing how to open the files. The two files were from this dataset: https://doi.org/10.5063/F1BG2KX8
The first file had an EML objectName of "SNAPP_SHAPES_METADATA.xlsx", but when downloaded via the getObject service it is given the name "snapp_computing5.1-DATA.data" on disk. The file has a type of application/octet-stream, and there is no fileName set in the file 1 system metadata.
The second file had an EML objectName of "SNAPP_Amazon_Aquatic_Ecosystem_Spatial_Framework.gdb.zip", but when downloaded via the getObject service it is given the name "snapp_computing6.1-DATA.data" on disk. It also has a type of application/octet-stream, and no fileName set in file 2 system metadata.
I suspect the problem is that, in the absence of the systemmetadata fileName field being set, the objectName from EML is not getting used to name the file in the web response. I vaguely remember us discussing this issue in the past and that it may be related to how hard it is to get all of the names needed into the SOLR index.
I think the solution is to use our multiple sources of truth for filenames, where we prefer the SystemMetadata fileName, then fall back to the EML objectName (or equivalent field from ISO), and then maybe even fall back to EML entityName, although that one is a big stretch and may not be appropriate.
@mbjones Currently Metacat heavily relies on the fileName field in the system metadata to figure out the name which will be used during downloading. If it can't find it, it falls back to look the file extension registered for the format type. If it still can't find the file extension, it just appends the ..data to the identifier. The quote from the previous ticket is : to use @systemmetadata.fileName@ when it's available, and fall back to generating a file name. The below link is the ticket:
https://github.com/NCEAS/metacat/issues/1174
I can tell the data package was originally generated by Morpho and it didn't set a good format id. So the downloaded file doesn't have the correct file extension.
In our previous ticket, we didn't require Metacat to look back the metadata file. The download action calls the getObject method and the only parameter is the identifier of the object. If Metacat needs to link the metadata, Metacat has to query solr in order to find out the associated metadata file. Since solr docs don't have the entity information, it has to parse the metadata object to get it. And Metacat support so many metadata standards, it is not easy to do the job. I think this is the reason we decided to get the file name solely on the system meta data.
Maybe a compromise here would be to manually fix the sysmeta fileName for all EML documents generated by morpho if they have a reasonable objectName set. That would then fix downloading for files that have reasonable names in the EML, punt on the ones that don't, and rely on sysmeta going forward given that we no longer support Morpho for uploads. This would likely be a one-time, behind-the-scenes update to fix problems (and would not affect versioning because its a sysmeta update). Thoughts @taojing2002 ?