droid
droid copied to clipboard
CSV Header does not contain column headers for multiple format matches
The CSV Writer repeats the last 4 columns ("PUID","MIME_TYPE","FORMAT_NAME","FORMAT_VERSION") when multiple matches are made against a file. https://github.com/digital-preservation/droid/blob/66ca5162bcd09c39dce12379d7c49aea835e960c/droid-export/src/main/java/uk/gov/nationalarchives/droid/export/CsvItemWriter.java#L144
However, the CSV header does not document what these headers are, or that they exist. CSV parsers that infer the number of columns for the entire CSV based on the header report errors when they encounter a row with more than 18 columns.
Relatedly, any given row in a DROID CSV will have 14 + 4x fields where x is number of matches for that particular file. All rows should have the same number of fields in order to simplify parsing.
A note on APIs and versioning here:
For those (including myself) who have built (multiple current) tooling around this as a known limitation in 1.6.x of outputting a single row per file, can I ask that any fixes surrounding this be put into a 1.7.x branch.
Thanks for raising the issue - not intending to change this behaviour specifically at this stage. However, will consider further when reviewing possible changes to profile output in future.
is there already an idea when this will be solved? the CSV's are now difficult (and frustrating) to process.
Hi thanks for raising this again. It's useful to hear when other users are experiencing difficulty with a particular issue. As fixing this involves changing the way the CSV output is formatted, which would have a downstream effect on other tools that currently consume DROID output, we would need to be careful about how we manage this change and had intended to look at this as part of potential work for a DROID 7.0 release, although we don't currently have a clear view on when that work might commence (our next release, 6.6 is currently planned for Autumn 2020).
Within PRONOM we try to minimise the potential for clashes that result in a multiple-identification outcome, particularly for those formats that have identification signatures (as opposed to extension-only identification outcomes, which are known to be less reliable). Are you able to provide information about the kinds of clashes you are seeing?
David
Mostly .mov-files that are identified as Apple ProRes and video/quicktime. But I want to import the CSV's in a MySQL database to do some analysis, but that's difficult when the CSV doens't have a fixed amount of columns.
that's difficult when the CSV doens't have a fixed amount of columns
For that reason I always tend to choose the option to have multiple identifications shown as separate entries in the export, rather than as multiple IDs on a single line. Even if we made sure there were headers so that there were a consistent number of fields for any given file you'd potentially still have an issue that you could find that one CSV file has two sets of columns for IDs, and then in another a file actually gets three IDs, it's practically impossible to guarantee that all CSV exports would have the same number of fields once you include multiple IDs on a single line.
thanks for the tip!
Fixed under a different issue number