droid CSV Header does not contain column headers for multiple format matches

The CSV Writer repeats the last 4 columns ("PUID","MIME_TYPE","FORMAT_NAME","FORMAT_VERSION") when multiple matches are made against a file. https://github.com/digital-preservation/droid/blob/66ca5162bcd09c39dce12379d7c49aea835e960c/droid-export/src/main/java/uk/gov/nationalarchives/droid/export/CsvItemWriter.java#L144

However, the CSV header does not document what these headers are, or that they exist. CSV parsers that infer the number of columns for the entire CSV based on the header report errors when they encounter a row with more than 18 columns.

Relatedly, any given row in a DROID CSV will have 14 + 4x fields where x is number of matches for that particular file. All rows should have the same number of fields in order to simplify parsing.

Oct 31 '16 16:10 nkrabben

A note on APIs and versioning here:

For those (including myself) who have built (multiple current) tooling around this as a known limitation in 1.6.x of outputting a single row per file, can I ask that any fixes surrounding this be put into a 1.7.x branch.

Oct 31 '16 21:10 ross-spencer

Thanks for raising the issue - not intending to change this behaviour specifically at this stage. However, will consider further when reviewing possible changes to profile output in future.

Nov 22 '16 11:11 Brian-O-TNA

is there already an idea when this will be solved? the CSV's are now difficult (and frustrating) to process.

Jun 23 '20 15:06 nvanderperren

Hi thanks for raising this again. It's useful to hear when other users are experiencing difficulty with a particular issue. As fixing this involves changing the way the CSV output is formatted, which would have a downstream effect on other tools that currently consume DROID output, we would need to be careful about how we manage this change and had intended to look at this as part of potential work for a DROID 7.0 release, although we don't currently have a clear view on when that work might commence (our next release, 6.6 is currently planned for Autumn 2020).

Within PRONOM we try to minimise the potential for clashes that result in a multiple-identification outcome, particularly for those formats that have identification signatures (as opposed to extension-only identification outcomes, which are known to be less reliable). Are you able to provide information about the kinds of clashes you are seeing?

David

Jun 23 '20 16:06 Dclipsham

Mostly .mov-files that are identified as Apple ProRes and video/quicktime. But I want to import the CSV's in a MySQL database to do some analysis, but that's difficult when the CSV doens't have a fixed amount of columns.

Jun 24 '20 15:06 nvanderperren

that's difficult when the CSV doens't have a fixed amount of columns

For that reason I always tend to choose the option to have multiple identifications shown as separate entries in the export, rather than as multiple IDs on a single line. Even if we made sure there were headers so that there were a consistent number of fields for any given file you'd potentially still have an issue that you could find that one CSV file has two sets of columns for IDs, and then in another a file actually gets three IDs, it's practically impossible to guarantee that all CSV exports would have the same number of fields once you include multiple IDs on a single line.

Jun 25 '20 10:06 DavidUnderdown

thanks for the tip!

Jun 25 '20 14:06 nvanderperren

Fixed under a different issue number

Dec 04 '23 11:12 sparkhi

droid droid copied to clipboard

CSV Header does not contain column headers for multiple format matches

droid
droid copied to clipboard