dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

File description metadata of ingested files are not in the DDI exported metadata

Open jggautier opened this issue 7 years ago • 20 comments

I've seen this omission/bug since at least Dataverse version 4.9 and verified it in Dataverse version 6.1:

In the exported metadata of datasets with ingested files, for files that Dataverse ingests, the file description text (which depositors enter for uploaded files) are not in the exported DDI.xml. For example, for the dataset at https://doi.org/10.7910/DVN/1ZPAKL, the ingested file's file description, "Data from vignette survey experiment conducted in Denmark in June 2023", is not in the exported DDI.

I think next steps would be to figure out if:

  • this is a bug, so the file descriptions of ingested files should be in the DDI, or
  • this is just an omission, and we have to figure out where in the DDI xml the file description should be

jggautier avatar Sep 12 '18 17:09 jggautier

2024/05/06

  • Determine if this is an omission or a bug.

cmbz avatar May 06 '24 18:05 cmbz

2024/07/10

  • Sized at 3 and assigned to @jggautier for assessment. Resize and reassign as needed.

cmbz avatar Jul 10 '24 19:07 cmbz

@amberleahey or @stevenmce might know off the top of their head where a file description should go in DDI.

pdurbin avatar Jul 10 '24 20:07 pdurbin

The spec is here: https://ddialliance.org/Specification/DDI-Codebook/2.5/

Are we updating to DDI version 2.1 or 2.5 (2.6 is also on it's way, but only just released for review: https://github.com/ddialliance/ddi-c_2).

stevenmce avatar Jul 11 '24 01:07 stevenmce

Here's a sample XML from the DDI spec site: https://ddialliance.org/sites/default/files/1990STF1_2_5.xml (From a study from Minnesota Population Center)

stevenmce avatar Jul 11 '24 01:07 stevenmce

Thanks for the links @stevenmce.

I think Dataverse is using DDI 2.5 already, right? That's what we say in the Appendix page of the Dataverse Guides and I see references to that version in the DDI exports. And when I opened this GitHub issue, I was referring to how Dataverse uses that version of DDI Codebook.

I think it'll be helpful to add what some folks from the Dataverse core team said about this GitHub issue during a planning meeting this week:

  • We weren't sure if this was a bug and the file descriptions of ingested files were meant to be included in DDI-C exports, or if those file descriptions were intentionally left out of the DDI-C export.
  • @pdurbin asked why this is important. I wondered if one reason is that excluding file descriptions of ingested files might make some data less discoverable. For example, if DDI-C metadata of one Dataverse repository is harvested into another repository, that repository won't be able to index what's in the description metadata of ingested files, which might help others find that dataset. And I recommended reaching out to learn more from other repositories that seem interested in this GitHub issue, like Dataverse SODHA.

With other priorities I'm not able to focus on this issue, so I'm recommending we move it out of the sprint ready column of the IQSS Dataverse Project board. @sbarbosadataverse, do you agree?

jggautier avatar Jul 12 '24 16:07 jggautier

I think exposing file descriptions via DDI is a great idea. I took a quick look at the links above but I wasn't able to quickly figure out which DDI field to use. 🤷

pdurbin avatar Jul 15 '24 14:07 pdurbin

A few things are happening for file metadata and DDI Codebook exports:

  1. Only tabular ingested files are getting added to the File Description <fileDscr> DDI tag set , AND all other files in the DV Dataset are added to the <OthMat> DDI tag (Other materials) see example (https://odesi.ca/api/ddi?id=/odesi/doi__10-5683_SP3_LDJZ8Y.xml)
  2. For tabular ingested files , the descriptions of these are not included in the <fileDscr> section, but for non-tabular files that are referenced in OthMat the descriptions are being included and are mapped to DDI <otherMat>/ e.g. "Command code - STATA format" for example (see full dataset in Borealis and tabular file with description not included in DDI exported XML here
  3. It's interesting that for OthMat files the notes is autogenerated by Dataverse for the MIME type (e.g. "text/x-spss-syntax")

Overall, I think the tabular data ingested files could remain in the File Dscr section and we add a TXT or NOTE tag to the set for the descriptions. We also noticed there were issues with mapping the new standard CC licenses (these do not get into the DDI) but custom licenses do so we had to set this up for all of Odesi. There are other mapping issues with Codebook that could be tackled by the DDI community and a new exporter could be built to support 2.5 , 2.6 with these improved mappings....

amberleahey avatar Jul 15 '24 17:07 amberleahey

@amberleahey thanks, that helped me find the writeFileDescription method that does indeed write to the DDI txt field, like you're saying, such as <txt>Command code - STATA format</txt> below.

<otherMat ID="f663995" URI="https://borealisdata.ca/api/access/datafile/663995" level="datafile" restricted="false">
<labl>CTNS2022_P_BSW.dct</labl>
<txt>Command code - STATA format</txt>
<notes level="file" subject="Content/MIME Type" type="DATAVERSE:CONTENTTYPE">application/octet-stream</notes>
</otherMat>
private static void writeFileDescription(XMLStreamWriter xmlw, FileDTO fileDTo) throws XMLStreamException {
    xmlw.writeStartElement("txt");
    String description = fileDTo.getDataFile().getDescription();
    if (description != null) {
        xmlw.writeCharacters(description);
    }
    xmlw.writeEndElement(); // txt
}

pdurbin avatar Jul 15 '24 18:07 pdurbin

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

cmbz avatar Aug 20 '24 15:08 cmbz

2024/08/23: Reopening because issue was already sized and prioritized.

cmbz avatar Aug 23 '24 20:08 cmbz

Just a quick note before I make a PR:

Overall, I think the tabular data ingested files could remain in the File Dscr section and we add a TXT or NOTE tag to the set for the descriptions.

Tabular ("ingested") files do need to remain in fileDscr sections - that's required by the schema essentially, since fileDscr provides the dedicated fields that encode information specific to tabular data (such as dimensns, caseQnty and varQnty). We cannot add a txt field for the description text there, like we do with otherMat, because it's not in the schema. But a note with an appropriate attribute seems like a good solution - and yes, we should have handled it like that all along.

I'm seeing that this was estimated as a "3", which is what we use for most straightforward fixes - like the amount of effort it would take to implement what I just described above, so I'll try and stay within that. :)

landreev avatar Oct 17 '24 13:10 landreev

So, it'll look like this:

<notes level="file" type="DATAVERSE:FILEDESC" subject="DataFile Description">
   This is a tabular file produced from a Stata .dta file with rich descriptive metadata
</notes>

landreev avatar Oct 18 '24 17:10 landreev

Thanks @landreev!

I opened this GitHub issue and merely described something that seemed inconsistent to me. But I think I should have also encouraged us to think about how we'll know that however this is resolved was a good way to resolve it. And I hope that we can discuss this now while considering your solution.

I imagine this would help anyone who needs to export the DDI-Codebook metadata of data in their repository in order to preserve that metadata. Does that sound right?

This change has no affect on how findable harvested datasets are, since I think Dataverse doesn't index any of the file-level metadata that it harvests from DDI-Codebook metadata.

jggautier avatar Oct 18 '24 18:10 jggautier

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

lubitchv avatar Oct 18 '24 18:10 lubitchv

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

I'll test it with your Data Explorer also. I can't imagine it actually causing a problem - since the new note has attributes clearly marking it as different from the other kinds of notes that can be found under <fileDesc>, I expect it to just be skipped. But yes, needs to be tested of courses.

Was good to see you at Dagstuhl! 🙂

landreev avatar Oct 18 '24 20:10 landreev

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

FWIW, the test in EditDDIIT is passing and EditDDI does not appear to be using <fileDsrc>.

landreev avatar Oct 18 '24 20:10 landreev

@jggautier Yeah, it was just a weird inconsistency. Was worth fixing just for the sake of striving to export as much of the information about the data as possible. Whether it'll ever benefit anyone significantly in real life, idk. As I mentioned in the PR, I'm guessing we haven't been exporting it for ingested files because there was no obvious place for it under <fileDscr> in the schema; but we should have used another free text note for it all along.

landreev avatar Oct 18 '24 20:10 landreev

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

FWIW, the test in EditDDIIT is passing and EditDDI does not appear to be using <fileDsrc>.

Right, I thought about DataDscr that is using additional note sections for curation, so yes, you are right, for data curation it should not matter, it does not using fileDscr. Although, I should talk to my colleague @nana-boateng. He is using xml codebook for our search tool odesi. I believe it should not matter, but we need to test it too.

lubitchv avatar Oct 18 '24 20:10 lubitchv

@nana-boateng confirms that the change should not affect our odesi.

lubitchv avatar Oct 24 '24 14:10 lubitchv