zotero-bits icon indicating copy to clipboard operation
zotero-bits copied to clipboard

Data Set

Open rmzelle opened this issue 14 years ago • 22 comments

There have been requests for a "Data Set" item type, but the idea is still rough.

https://github.com/ajlyon/zotero-bits/wiki/DatasetType

Related CSL schema ticket: https://github.com/citation-style-language/schema/issues/74

rmzelle avatar Jan 19 '11 01:01 rmzelle

What about the archival collection type? Here are discussion and ticket:

https://www.zotero.org/trac/ticket/661 https://www.zotero.org/trac/ticket/1023 http://forums.zotero.org/discussion/2981/styling-archival-material/ http://forums.zotero.org/discussion/391/1/hierarchical-item-relationships/#Item_48

These are now included in library catalogs and can be imported into Zotero. It would be a great aid in research, archives need to be cited in bibliographies, and this would need to be implemented in hierarchical item types anyway.

erazlogo avatar Jan 25 '11 16:01 erazlogo

Elena: Are you suggesting that archival collections and data sets could share a new type? I see some conceptual connection between the two, but I think they differ enough in presentation (data sets are often more like articles in citation, right?) that they'll need to be treated differently.

Could you draft a proposal for archival collections as a type and create a new issue for them?

avram avatar Jan 25 '11 17:01 avram

The RIS translator has the comment: // TODO: DATA, MUSIC That is, we would like to support datasets for importing from RIS. Maybe someone can describe what Endnote puts in a dataset item?

avram avatar Mar 23 '11 08:03 avram

Dataset should probably have type/genre/medium field - some styles like APSA call for a label such as "computer file" etc. We also need to think about the distinction between producer and distributor, see e.g. here: http://www.lib.ncsu.edu/data/citingdatasets.html I think archive could probably be used for the distributor - in that case all the archive fields should be present.

One issue with datasets is that while they can, in general, be acomodated within other item types, this is often not consistent - I think I note somewhere on the forums that they're sometimes treated like articles, sometimes like monographs and sometimes like a third, hybrid category (e.g. neither italics nor quotation marks).

adam3smith avatar Jun 11 '11 03:06 adam3smith

DataCite has done some work on the metadata for datasets: http://www.cdlib.org/cdlinfo/2011/01/24/datacite-metadata-scheme-is-published/. Their required fields are Identifier, Creator, Title, Publisher, PublicationYear.

mfenner avatar Oct 18 '11 07:10 mfenner

That looks like a great set of fields (including the optional ones). Any objections to taking this as our model? Will it cover sufficiently broad use cases?

avram avatar Oct 18 '11 20:10 avram

in general yes - as I note above, we do need a distributor field in addition to the publisher field for some styles (e.g. ICQMR) - the archive field works for that, but we do need to make sure it's included.

adam3smith avatar Oct 18 '11 20:10 adam3smith

Another useful data citation resource was published by the Digital Curation Centre earlier this week: http://www.dcc.ac.uk/resources/how-guides/cite-datasets. The guide also includes a section on elements of a data citation: author, publication date, title, edition, version, feature name and URI, resource type, publisher, unique numeric footprint, identifier, location. The most important ones are author, title, date, location.

mfenner avatar Oct 19 '11 05:10 mfenner

I think the DataCite metadata description as the model for a variety of reasons - especially because they already have more than one million datasets in their database and people will be citing them.

mfenner avatar Oct 19 '11 05:10 mfenner

Any thoughts on what we'll need from CSL to make this work?

avram avatar Oct 19 '11 05:10 avram

I just noticed that datacite.org has started to offer citeproc JSON, but they use (the invalid) "misc" as item type, since CSL doesn't have a suitable item-type. See http://datacite.org/node/63

rmzelle avatar May 17 '12 17:05 rmzelle

Ignoring the discussion about which fields are needed to handle datasets, is there already enough of a consensus that adding a "dataset" CSL item type is a good idea? @bdarcus greenlighted it in 2009 (http://forums.zotero.org/discussion/4771/item-types/?Focus=25845#Comment_25845), and I'm in support for it as well.

rmzelle avatar May 17 '12 18:05 rmzelle

I would really like to see a dataset CSL item type. Another step to make it easier to add dataset citations to tools handling references.

mfenner avatar May 17 '12 18:05 mfenner

There's a related discussion at https://github.com/IQSS/dataverse/pull/3828#issuecomment-310395228 and I'd like to thank @adam3smith for testing Zotero with Dataverse! We'll keep an eye on this issue.

pdurbin avatar Jun 22 '17 16:06 pdurbin

I do have use for type: Dataset.

The fields needed are well described in http://www.dlib.org/dlib/january11/starr/01starr.html

"When the DataCite Consortium was founded in 2009, the development of a DataCite metadata scheme was an early priority." The Metadata Working Group did spend a few years creating the scheme, so I think we should just use their suggestion, unless there is something newer.

Social scientists use many datasets that should be cited. Sometimes they are surveys collected in a particular location between two dates (i.e. the source does not change), sometimes they are data from public sources like Eurostat, where new data is added at regular intervals (so the source does change).

tangofil avatar Apr 01 '21 18:04 tangofil

This is implemented in CSL since version 1.0.1, currently available in Zotero using a workaround and will be available in Zotero as a regular item type in the future. No need for future explanations.

adam3smith avatar Apr 01 '21 18:04 adam3smith

Is there any info on when it will be included? This issue is 10 years old.

philippemiron avatar Aug 03 '22 15:08 philippemiron

Zotero never does ETAs, but they've been making changes to the data model, so probably not too far out (I'd guess months, but that's just a guess)

adam3smith avatar Aug 03 '22 15:08 adam3smith

@dstillman since you're working on standard already (and it'd make a lot of people in my line of work happy if we got this into Zotero) could I advocate for including this into the next type update:

Here are the proposed field: Name: Dataset (there are some discussion here with "Data", "Data set," "Dataset" and several other contenders, but I think Dataset makes the most sense).

  • Standard fields: Title, Author/Contributor, Date, Language. Short Titel, URL, DOI, Accessed, Archive, Loc. in Archive, Library Catalog, Call Number, Rights, Extra, Data Added, Modified
  • Identifier (CSL: number) -- while DOIs are very common, many repositories especially in the lifesciences have their own ID schema
  • Repository and Repository Location (publisher and publisher-place respectively. The latter is getting rare, but we still see it in requested citations).
  • Version
  • Type (CSL: genre)
  • Medium (e.g., if older Data are on CD-ROM etc.)

I've checked this against the latest iteration of the DataCite Metadata schema and it hits all relevant fields that could possibly be cited. I think it's worth keeping Archive in there for historical data that's not in a repository but in a physical archive.

adam3smith avatar Aug 08 '22 14:08 adam3smith

Do we want to label it Medium or Format (which is what is currently used for Audio Recording, Film, Video Recording)? I don't have a strong preference for one over the other.

bwiernik avatar Aug 08 '22 15:08 bwiernik

It's Format in DataCite, so if we're using that elsewhere already, let's stick with that.

adam3smith avatar Aug 08 '22 15:08 adam3smith

Great, let's go with Format

bwiernik avatar Aug 08 '22 15:08 bwiernik

Do we want to label it Medium or Format (which is what is currently used for Audio Recording, Film, Video Recording)? I don't have a strong preference for one over the other.

@bwiernick Audio and Film should be cited and referenced as either film or audio units; OR collections. So Medium for dataset here makes sense to me. Frankly datasets should not be audio or video content. There is a difference between a dataset and a collection. People need to be citing these groups of audio and video artifacts as collections.

HughP avatar Apr 20 '23 20:04 HughP

Frankly datasets should not be audio or video content. There is a difference between a dataset and a collection. People need to be citing these groups of audio and video artifacts as collections.

The vast majority of people working on data and data-infrastructure would disagree with that (the are data repositories specifically dedicated to video data), but that's also not the point here, so we don't need to solve it.

The reason bwiernik mentioned that 'Format' is used for video-type item types is it is used for the format in which video content is delivered (e.g., on DVD, CD-ROM, Blueray), which is reasonably similar to the types of formats cited for data delivery (where still relevant) such as DVD, CD-ROM, so it makes sense to use the same variable. I honestly think either option would have been fine, but, as I said, aligning this with the Datacite metadata terminology probably makes sense.

adam3smith avatar Apr 20 '23 20:04 adam3smith