jams icon indicating copy to clipboard operation
jams copied to clipboard

More than one curator?

Open urinieto opened this issue 5 years ago • 13 comments

From this issue I realized that the current jams schema doesn't allow for a given annotation to be curated by more than one person. Is that true? If so, we should consider enhancing the schema to allow a list of curators.

urinieto avatar Nov 20 '18 00:11 urinieto

could you provide an instance where one annotation is provided by more than one curator, instead of multiple annotations provided by one curator each?

ejhumphrey avatar Nov 20 '18 00:11 ejhumphrey

It's not that a single annotation is "provided by more than one curator" (the annotation is provided by an annotator, right?), but that a single annotation can be curated by one or more people. For example, the HEMAN dataset employs data that were curated (but not annotated) by me and that later @irisyupingren further reviewed, cleaned, published, and formatted (i.e., curated?).

Maybe I'm employing the word "curator" in the wrong way? To be honest, it's a bit confusing from the jams specification docs:

curator : a structured object containing contact information (name and email) for the curator of this data; annotator : a sandbox object to describe the individual annotator — which can be a person or a program — that generated this annotation;

urinieto avatar Nov 20 '18 01:11 urinieto

@urinieto your take is correct (imo), annotator is whoever generated the specific annotation, while curator is the person (or people) who did the work of collecting all the annotations into a dataset. Under this reasoning, it makes perfect sense to have more than one curator.

justinsalamon avatar Nov 20 '18 01:11 justinsalamon

Glad we agree on this, Justin. I just reviewed the original paper, and it also feels a bit confusing to me (who wrote this paper?):

curator (F) is itself an object with two subfields, name and email, for the contact person responsible for the annotation; and annotator (G) is another unconstrained object, which is intended to capture information about the source of the annotation

urinieto avatar Nov 20 '18 01:11 urinieto

ah yes, you're right + I'm wrong – curator is the person(s) responsible for collecting the annotation, the annotator is the observer.

I guess one thing that we punted on pretty hard was having an "agent" datatype in the schema; we weren't really sure what kinds of curators would crop up (people? teams? universities?), and so it got left as a single string. In hindsight, this is kind of a great problem to have, since it's lightyears ahead of unstructured text files..

maybe I could rephrase my question better: would an array of open-ended strings be enough? or is there enough data to infer what a more structured Curator object might look like?

also I definitely wrote that section of the paper, so that makes me 0/2 on this thread.

ejhumphrey avatar Nov 20 '18 01:11 ejhumphrey

haha even if you wrote that section, we should've pointed out how ambiguous it was (i.e., don't be too hard on yourself, we The Authors are all to blame here 💃 ).

Ok, back to your question, I would go with either a single open-ended string (e.g., "Eric Humphrey [email protected] & Justin Salamon [email protected]"), or an array of Curator objects. Since the open-ended string might become a bit too complicated to parse, the latter option seems better to me.

Also, ideally, I would allow either one Curator or a list of Curators, but not sure how ugly this would look in terms of schema design/validation.

urinieto avatar Nov 20 '18 02:11 urinieto

I see the Curator field as more of a point of contact, rather than the person who constructed the data per se. The way I'd want to use it is to have a way to chase down whoever's responsible for bugs or revising annotations going forward, not so much as a historical attribution field. (The latter should go in the annotator field.)

In that light, I'm not sure having multiple points of contact makes much sense, but I agree that the current Curator field is lacking in many ways.

Maybe it's worth considering the proposal data management / revisioning that @mcartwright wrote up for our OSS-MIR paper in IEEE-SPL? Thinking about how the Curator field could be expanded / replaced by something more useful for specific purposes, eg, where to send bug reports. In that case, maybe a URL is more appropriate than an email address? Or something else entirely?

bmcfee avatar Nov 20 '18 02:11 bmcfee

generally if a field is to be repeated, it should always be an array, and maybe specify a minimum number of elements. afaik mixing data types (allowing Curator and Array<Curator>) is poor form.

to @bmcfee's point, I really like the idea of thinking about why it exists. If curator equals "who do I bother", then perhaps either URLs or email addresses are equally fine?

ejhumphrey avatar Nov 20 '18 02:11 ejhumphrey

My 2c as the person who put the "curator" field in there in the first place :)

The intention was precisely for attribution. While a dataset can have many annotators (especially in a crowdsourcing scenario), it usually has a small set of curators who are in charge of putting the whole thing together, quality control, etc. Basically like an art exhibition that may consist of artworks by multiple artist (annotators in this analogy), it is usually curated by just one or two people, the curators.

Personally I think it's important to have such a field, because annotator(s) != curator(s) != point of contact.

The assumption was that the first curator is also the POC, and that people would infer that on their own. If you think it's worth adding an explicit "contact" field (e.g. with an email address) I'm totally fine with that, but not at the expense of the "curator" field, IMO.

justinsalamon avatar Nov 20 '18 02:11 justinsalamon

p.s. forgot to add, in light of the above, I'd support @urinieto's proposal of making the curator field a list of Curator.

justinsalamon avatar Nov 20 '18 03:11 justinsalamon

Coming back to this one, it seems to me that curator is a collection-level attribute, not an annotation-level attribute. As we start planning for #178 and more collection-oriented things, does it even make sense to keep a curator field in the annotation metadata objects?

I'm thinking it might be better to lift that up a level; annotations can belong to collections, and collections can have curators, as well as other properties: home page, DOI, etc. For my typical use cases, a DOI pointing to a zenodo page for the dataset would be perfect. From there, I can get all the attribution and contact info I need, and the maintainers can worry about keeping things up to date there. For example, if a curator changes email address, there's currently no mechanism to propagate that information back to a bunch of jams files out on the internet. Relying on zenodo (or figshare, or whatever it happens to be) for this seems like a much better approach.

bmcfee avatar Aug 12 '19 17:08 bmcfee

I agree with @bmcfee: curator should be a collection-level attribute, and there may be more than one curator associated with a collection.

My only concern is that changing this would potentially make pretty much all JAMS files to date incompatible with the new schema. Unless we do something smart about it, with deprecation warnings and so on.

urinieto avatar Aug 12 '19 18:08 urinieto

My only concern is that changing this would potentially make pretty much all JAMS files to date incompatible with the new schema.

Yup, that'll happen. The ideal fix here will be to 1) standardize the schema into a self-contained definition (ie without namespace runtime patching) as noted in #178, 2) put the schema under proper version control, and 3) put converters in place for migrating between versions.

If we set this up properly, then migration should be pretty easy, since we're going from a "exactly one of" to a "zero or more of" type of field, though obviously the python object model will have to change to stay usable.

bmcfee avatar Aug 12 '19 18:08 bmcfee