zotero-bits
zotero-bits copied to clipboard
Data Set
There have been requests for a "Data Set" item type, but the idea is still rough.
https://github.com/ajlyon/zotero-bits/wiki/DatasetType
Related CSL schema ticket: https://github.com/citation-style-language/schema/issues/74
What about the archival collection type? Here are discussion and ticket:
https://www.zotero.org/trac/ticket/661 https://www.zotero.org/trac/ticket/1023 http://forums.zotero.org/discussion/2981/styling-archival-material/ http://forums.zotero.org/discussion/391/1/hierarchical-item-relationships/#Item_48
These are now included in library catalogs and can be imported into Zotero. It would be a great aid in research, archives need to be cited in bibliographies, and this would need to be implemented in hierarchical item types anyway.
Elena: Are you suggesting that archival collections and data sets could share a new type? I see some conceptual connection between the two, but I think they differ enough in presentation (data sets are often more like articles in citation, right?) that they'll need to be treated differently.
Could you draft a proposal for archival collections as a type and create a new issue for them?
The RIS translator has the comment: // TODO: DATA, MUSIC That is, we would like to support datasets for importing from RIS. Maybe someone can describe what Endnote puts in a dataset item?
Dataset should probably have type/genre/medium field - some styles like APSA call for a label such as "computer file" etc. We also need to think about the distinction between producer and distributor, see e.g. here: http://www.lib.ncsu.edu/data/citingdatasets.html I think archive could probably be used for the distributor - in that case all the archive fields should be present.
One issue with datasets is that while they can, in general, be acomodated within other item types, this is often not consistent - I think I note somewhere on the forums that they're sometimes treated like articles, sometimes like monographs and sometimes like a third, hybrid category (e.g. neither italics nor quotation marks).
DataCite has done some work on the metadata for datasets: http://www.cdlib.org/cdlinfo/2011/01/24/datacite-metadata-scheme-is-published/. Their required fields are Identifier, Creator, Title, Publisher, PublicationYear.
That looks like a great set of fields (including the optional ones). Any objections to taking this as our model? Will it cover sufficiently broad use cases?
in general yes - as I note above, we do need a distributor field in addition to the publisher field for some styles (e.g. ICQMR) - the archive field works for that, but we do need to make sure it's included.
Another useful data citation resource was published by the Digital Curation Centre earlier this week: http://www.dcc.ac.uk/resources/how-guides/cite-datasets. The guide also includes a section on elements of a data citation: author, publication date, title, edition, version, feature name and URI, resource type, publisher, unique numeric footprint, identifier, location. The most important ones are author, title, date, location.
I think the DataCite metadata description as the model for a variety of reasons - especially because they already have more than one million datasets in their database and people will be citing them.
Any thoughts on what we'll need from CSL to make this work?
I just noticed that datacite.org has started to offer citeproc JSON, but they use (the invalid) "misc" as item type, since CSL doesn't have a suitable item-type. See http://datacite.org/node/63
Ignoring the discussion about which fields are needed to handle datasets, is there already enough of a consensus that adding a "dataset" CSL item type is a good idea? @bdarcus greenlighted it in 2009 (http://forums.zotero.org/discussion/4771/item-types/?Focus=25845#Comment_25845), and I'm in support for it as well.
I would really like to see a dataset CSL item type. Another step to make it easier to add dataset citations to tools handling references.
There's a related discussion at https://github.com/IQSS/dataverse/pull/3828#issuecomment-310395228 and I'd like to thank @adam3smith for testing Zotero with Dataverse! We'll keep an eye on this issue.
I do have use for type: Dataset.
The fields needed are well described in http://www.dlib.org/dlib/january11/starr/01starr.html
"When the DataCite Consortium was founded in 2009, the development of a DataCite metadata scheme was an early priority." The Metadata Working Group did spend a few years creating the scheme, so I think we should just use their suggestion, unless there is something newer.
Social scientists use many datasets that should be cited. Sometimes they are surveys collected in a particular location between two dates (i.e. the source does not change), sometimes they are data from public sources like Eurostat, where new data is added at regular intervals (so the source does change).
This is implemented in CSL since version 1.0.1, currently available in Zotero using a workaround and will be available in Zotero as a regular item type in the future. No need for future explanations.
Is there any info on when it will be included? This issue is 10 years old.
Zotero never does ETAs, but they've been making changes to the data model, so probably not too far out (I'd guess months, but that's just a guess)
@dstillman since you're working on standard already (and it'd make a lot of people in my line of work happy if we got this into Zotero) could I advocate for including this into the next type update:
Here are the proposed field: Name: Dataset (there are some discussion here with "Data", "Data set," "Dataset" and several other contenders, but I think Dataset makes the most sense).
- Standard fields: Title, Author/Contributor, Date, Language. Short Titel, URL, DOI, Accessed, Archive, Loc. in Archive, Library Catalog, Call Number, Rights, Extra, Data Added, Modified
- Identifier (CSL: number) -- while DOIs are very common, many repositories especially in the lifesciences have their own ID schema
- Repository and Repository Location (publisher and publisher-place respectively. The latter is getting rare, but we still see it in requested citations).
- Version
- Type (CSL: genre)
- Medium (e.g., if older Data are on CD-ROM etc.)
I've checked this against the latest iteration of the DataCite Metadata schema and it hits all relevant fields that could possibly be cited. I think it's worth keeping Archive in there for historical data that's not in a repository but in a physical archive.
Do we want to label it Medium
or Format
(which is what is currently used for Audio Recording, Film, Video Recording)? I don't have a strong preference for one over the other.
It's Format
in DataCite, so if we're using that elsewhere already, let's stick with that.
Great, let's go with Format
Do we want to label it
Medium
orFormat
(which is what is currently used for Audio Recording, Film, Video Recording)? I don't have a strong preference for one over the other.
@bwiernick Audio and Film should be cited and referenced as either film or audio units; OR collections. So Medium
for dataset here makes sense to me. Frankly datasets should not be audio or video content. There is a difference between a dataset and a collection. People need to be citing these groups of audio and video artifacts as collections.
Frankly datasets should not be audio or video content. There is a difference between a dataset and a collection. People need to be citing these groups of audio and video artifacts as collections.
The vast majority of people working on data and data-infrastructure would disagree with that (the are data repositories specifically dedicated to video data), but that's also not the point here, so we don't need to solve it.
The reason bwiernik mentioned that 'Format' is used for video-type item types is it is used for the format in which video content is delivered (e.g., on DVD, CD-ROM, Blueray), which is reasonably similar to the types of formats cited for data delivery (where still relevant) such as DVD, CD-ROM, so it makes sense to use the same variable. I honestly think either option would have been fine, but, as I said, aligning this with the Datacite metadata terminology probably makes sense.