dwc
dwc copied to clipboard
New term - verbatimLabel
New term
- Submitter: Hannu Saarenmaa
- Efficacy Justification (why is this term necessary?): To provide a digital representation derived from and as close as possible in content to what is on the original label(s).
- Demand Justification (name at least two organizations that independently need this term): Survey of digitizing collections conducted by @tmcelrath, DataShot (MCZ), TaxonWorks
- Stability Justification (what concerns are there that this might affect existing implementations?): New term, does not adversely affect any existing terms or implementations.
- Implications for dwciri: namespace (does this change affect a dwciri term version)?: As a "verbatim" term, dwc:verbatimLabel is not expected to have a dwciri: analog, so there are no implications in that namespace.
Proposed attributes of the new term:
- Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel
- Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample
- Definition of the term (normative): The full text from all labels affixed on or near a MaterialSample, free from any and all interpretation, translation, or transliteration.
- Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text except to designate lines or breakpoints between blocks of text to establish context that could be verified by seeing the original labels or images of them.
- Examples (not normative):
- Refines (identifier of the broader term this term refines; normative): None
- Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None
- ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText
Original comment:
Was https://code.google.com/p/darwincore/issues/detail?id=124
Submitter: Hannu Saarenmaa
Justification: In the first phase of the digitisation process we try to capture everything "as is". Interpretation should follow from that.
Definition: The full, verbatim text from the specimen label.
Comment: There are various verbatim fields in Darwin Core already, but they do not capture everything.
Refines:
Has Domain: Separators for line and different labels are needed. They need to be something that cannot possibly be present in label texts, such as $ and §.
Has Range:
Replaces:
ABCD 2.06:
Oct 6, 2011 comment #1 wixner I second that. At GBIF we had created our own term for this and it would be lovely to reuse a dwc term instead: http://rs.gbif.org/extension/gbif/1.0/typesandspecimen.xml#verbatimLabel
Sep 23, 2013 comment #4 gtuco.btuco I would like to promote the adoption of this term. To do so, I will need a stronger proposal demonstrating the need to share this information - that is, that independent groups, organizations, projects have the same need and can reach a consensus proposal about how the term should be used.
Sep 23, 2013 comment #5 gkamp76 verbatimLabel information, capturing all labels as they appear with the specimen, is essential for preserving the original information before subsequent interpretation takes place. It is in fact, one of the simpler tasks (aside from handwriting interpretation) for relatively untrained data entry workers to do. Any future interpretations of the data from the verbatimLabel can then be compared as political boundaries change, shortening or changing of collector information with subsequent publication, any number of interpretations may need to ultimately refer back to the original source: the verbatimLabel.
The question I would propose is if you are talking about all labels, what do you really mean? Would this include specimen identifer labels? Determination labels? Type labels? Loan labels? The latter are often removed when a loan is returned. What constitutes the "original" verbatimLabel" information? At the time of recording, having all of this information in one place (and if photographed, all are easily included) could be helpful as future workers realize for example, that the attribution of one person as a determiner was incorrect given the date and taxon in question, and it was actually someone else with similar initials and family name.
Sep 23, 2013 comment #6 gtuco.btuco It might be a good idea to circulate the proposal on tdwg-content and see if a community can be built around and support the addition of this concept.
This proposal still needs evidence of demand.
My question is, "Is it not sufficient/preferable to capture the label images? That is one level less of interpretation already."
We use this field in the TaxonWorks. We split it into three fields "Buffered Determination Label", "Buffered Collecting Event Label" and "Buffered Other Labels". Just having an image is not enough, or sometimes we do not have an image.
Basically, I, and many other collections using TaxonWorks, want this DWC field.
Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)?
How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope.
Wes use a field for verbatim transcription of a label in the DataShot object to image to data workflow software. This captures the verbatim transcription of text from a region of interest representing a single label identified in an image of a set of labels. Subsequent workflow steps add interpretation of this verbatim text into structured data. In a less formal manner, there is a twitter feed https://twitter.com/EntoTranslator and a facebook group https://www.facebook.com/groups/232785306782255/ where images of difficult to interpret labels are posted for members of the community to either provide transcriptions from difficult to read handwriting or interpretations of words, phrases, abbreviations, and such on the labels. There are clear upstream needs in digitization workflows for representing verbatim label text in structured form.
Closing for lack of demand.
"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?"
TDWG members discussing a good idea does not constitute demand. The demand requirement needs independent organizations with a mission-driven need to share these data.
On Mon, Apr 19, 2021 at 10:58 AM Tommy McElrath @.***> wrote:
"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?"
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/32#issuecomment-822488017, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ723ZZ7E44AXMGQFHUFTTJQZG7ANCNFSM4AXK2UUA .
@tucotuco What specifically, do you want us to provide then? would a survey of different natural history collections members with documented support of their need of this field suffice?
@tmcelrath TaxonWorks suffices to represent that class of proponent. That is the equivalent of one proponent. What other organization or project needs it? If you can come up with that, the next step is to submit a templated New term request. I can do that, adding it to the beginning of the first comment to keep all the discussion in one place, but I need that evidence of demand.
As noted above, We've got a field for this in the DataShot system at the MCZ associated with a region of interest in an image that contains multiple lables, but haven't been able to go very far with this in the absence of a means of sharing with the community.
This initially seems like a straightforward enough proposal, but how does it interplay with the existing (and numerous) verbatim fields within DarwinCore? It seems to risk becoming a dumping ground for data that could/should go into existing fields, and perhaps discouraging their use because it's easier to just put it all, unstructured, into verbatimLabel.
I think my main reservation is the following: are there many examples where the existing verbatim fields are inadequate, and could these be better covered by additional verbatim field(s) rather than such a loosely defined single field?
@edwbaker The issue is actually slightly different. "Parsing" text into many verbatim fields automatically introduces interpretation by its very nature. For example: What is a "verbatimLocality"? Should all locality info go in it? Or just the most specific locality? We've had differences of opinion just within our own group on just this one field.
To answer your question, DWC absolutely does not have enough verbatim fields. There are no verbatim identification fields, or verbatim curation labels fields (e.g. accession numbers, comments about preparation, etc ...). We use the ones that DWC has in addition to the verbatim one we are providing. Users do not have to use these fields, and yes, it introduces duplication of text, but that actually adds more power in terms of text-breakdown. We will never stop misreading labels and having poor quality control, but having this field allows for comparisons to the original verbatim label and will allow for corrections to be made.
The idea of this field is in part, quality control. I have found having this field INVALUABLE more times than I can count when looking back at the original text, comparing incorrect GPS coords, poorly interpreted localities, or people misreading labels.
To anyone following this thread, I have a poll out right now: https://forms.gle/fgxbQUmQLQC4a1NY6 collecting people's thoughts about this proposed DWC field. Please help me gather responses there. I am looking to get as many diverse stakeholders as possible.
Reopened to accommodate renewed vigor in the proposal.
What I'm wondering about this proposal is if we are conflating data management with implementing a standard. In my work for OBIS-USA I rarely receive data already in Darwin Core and I have to do a crosswalk. When I do that work there is always a chance that I performed that work incorrectly in some way and so I do my best to preserve the original data in a data repository and a link to that in the IPT so that future users of the data can get back to the original data to check the translation if they need to. For me it would not make sense to have all of that information stored in verbatim fields. When and how is the best place to separate out the standardization of the data from management of the data? Apologies if my comment doesn't make sense in this context since this is primarily considering museum collection data and I'm thinking of sampling event data.
@albenson-usgs I think the only way of going back to the original data here is to include a label image. Having a label field is one potential source of error, then any further processing from that is another potential source of error.
There are a number of potential solutions to "the verbatim problem" in this thread (using either SKOS or a separate dwc namespace).
So far in poll, all respondents want to see this term implemented in some form:
Respondents are from a variety of different Collection Management Systems/databases:
About half of respondents already use this field in their CMS:
There are various different use cases for verbatim data. We described quite a few of them in a paper we wrote a while ago, more specifically in this table..
Darwin Core terms currently hardly support these use cases, with many verbatim concepts unaccounted for and no unambiguous term for the uninterpreted text dump as Tommy described.
While the content of this term will be messy and not very practical for machine training purposes, which seems like it could be a nice use case, it would support improved findability, validation efforts and linguistic aspects.
The issue I see with adding verbatimLabel or an equivalent (in name it doesn't cover other data sources, such as occurrences from a notebook) is that if we have that, why do we need all the verbatim fields in dwc? The current process seems to be we put the label data in verbatimX and cleaner data in X. If we follow this precedent, then we should look at what verbatim label data is missed at present, and how we address that (two possible solutions in my above comment). If we don't follow this precedent then (in my mind) we have a much larger discussion.
I think the point raised above by @albenson-usgs between data management (which I take in this instance to broadly be within an institution) and data standards (broadly between institutions) is highly relevant. From what I can see (glancing over dwc) this would be the first break from relatively atomic data to a definition that might include multiple data types. This alone I think is worthy of some serious discussion.
I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.
Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/.
Given that GBIF has minted a term, are there any stability issues with Darwin Core making one? Does the term have a definition? If so, is it semantically the same as proposed here?
On Tue, Apr 20, 2021 at 10:31 PM Ed Baker @.***> wrote:
Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/32#issuecomment-823709546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72Z4TJHFLADWLW5ZVBDTJYTHJANCNFSM4AXK2UUA .
I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.
...and, having supplied a near-perfect clone of the item as an image, could we then eliminate all the existing verbatimX
terms? If you want verbatimX
, look at the image!
A significant part of this discussion stems from a lack of precision on all verbatimX
terms & @albenson-usgs has identified this very well. DwC is an exchange standard & we're trying to shoehorn our very real need to track the provenance of data and the decisions/interpretations made. If anything, it would help the users of our data to better appreciate what goes into crafting assertions. If we take verbatimLocality
as an example, it is defined as:
The original textual description of the place.
It is not qualified with, "...as written on physical media in close proximity to the physical object in question, free from any and all interpretation, translation, or transliteration." Evidently, it is not meant to represent a near-perfect snapshot. It offers no particular guidance when the original textual description of the place is physically, conceptually, or temporally removed from the physical specimen itself. What if that description of place is in Inuktitut in a field notebook held in an entirely different institution, written 2 years before the specimen was collected?
If we proceed with this, I would really like to see far more precision on the definition of verbatimLabel
so it is abundantly clear what is its expected content & there are no downstream misunderstandings of how it could/should be used.
Definition: The full text from all labels affixed on or near a specimen, free from any and all interpretation, translation, or transliteration. There are no embellishments, prefixes, headers or other additions made to the text. However, lines or breakpoints between blocks of text are faithfully represented to establish context.
That said, is OCR-generated text from a label considered verbatim
? What do you do about all the machine-generated artifacts, which are evidently "embellishments" of a sort? Is curation by a human an implicit requirement for verbatimLabel
?
...and, having supplied a near-perfect clone of the item as an image, could we then eliminate all the existing verbatimX terms? If you want
verbatimX
, look at the image!
Ha! This takes me back to a conversation many years ago, when @stanblum was working at Bishop Museum, and he and I had this exact conversation about verbatimx
fields. We both agreed that the true "verbatim" data would be an image of the label and/or catalogue ledger. This wasn't snark -- several of our collection managers have used handwriting to help sleuth out various data mysteries (the hadwriting points to who wrote it, which points to other sources associated with that person, etc.)
At the time, digital imaging technology was such that it was a pipe-dream on a Museum budget. But now it's becoming the norm.
So... yes... +1 for shifting to images of labels & ledgers in place of ASCII/UTF-8 encoded interpretations of "original data".
The GNA verbatimLabel term is a part of the Types and Specimen extension, which extends core Taxon data to support multiple type names or type specimens. This extension does not (currently) do the same for Occurrence data.
Raw images, even of segmented labels, are not a perfect substitute for verbatim (annotated) strings. Images are not machine readable and may not be as human readable as textual strings or strings annotated into different verbatim terms with a more specific meaning. Handwritten text may be poorly legible and label text may be ambiguous in its meaning. Partially transcribed text may allow different people looking at the image to build on each other's work.
The problem here is, as I suggested earlier, that verbatim data have many different use cases. For some use cases, looking at the image is sufficient or even optimal. For others, it is not feasible at all. For some, different verbatimX terms are desirable. For others, they are not useful at all. We're not going to address everyone's concerns with a single new term or namespace.
More specific standards exist for the exchange of verbatim text captured from images, such as Alto. But these have their own drawbacks and there are some complications when mixing handwritten text, typed text and marked up text such as logos, stamps, tables...
Results from the survey are in and viewable here: https://docs.google.com/spreadsheets/d/1eIiAgM_nJ_XpGbUCQ8f-ftO4ocIHFfklwHpa79OVdWk/edit#gid=192695490
In short, 97.7% of respondents want this field in some form (3 respondents wanted images included too, but supported the field in principle). This represents 18 different Collection Management systems, 33 different institutions, in multiple countries.
Respondents were evenly split on doing this as 1 vs. 3 fields, but I think considering some comments above, 1 field which minimizes interpretation is best, CMS can easily merge to one field.
This field is useful for quality control, transcription workflows, artificial intelligence learning of labels (see paper linked by @matdillen above) and more. Nearly the entire community wants it, are mostly already using it (60% already supporting in survey results), and there is active discussion and consensus that it is useful.
All comments so far against can basically be summarized as:
- Use images instead (great but not every label will be imaged, and OCR still needs an output field; additionally even if you move from text to images, you may still need an original text field to store "all data"
- Why do this when we could just make "verbatimX" fields. Separate issue in most respects. Do we probably need that? Yes. Is that the issue in question? No.
- The term needs clarification. Absolutely. Let's do that in discussion. Personally, I really like @dshorthouse definition.
See additional comments in the spreadsheet, or I'm happy to add them here.
@tucotuco I'd be happy to lead/co-lead the next step in getting this DWC term adopted. Point me to what I need to do, and I'll do it.
I worry that there are domain-specific issues/practices that come into play here in how verbatimLabel
might be populated. Take for example this image of a botanical specimen. Do you proceed top-bottom left-right bottom-top (making a 'U' in your traversal) to fill the single field? Could that seemingly innocuous decision to collapse the semantic dimensionality/positioning of labels result in downstream misinterpretation when the image is no longer present for examination? Note also that there are explicit "fields" expressed on some of these labels like "Locality:" or "Date:" and that these may be presented in single or multiple columns or may not have any content at all. For example, there is one label here with two "Date" prefixes but only one of them has human-supplied content - do you still supply both as part of the verbatimLabel
? Absence of data in a particular "field" on a label might itself be meaningful & so inclusion of "Date" without a value could be important. One candidate definition above in a bid to seek precision states that we ought to remove embellishments or prefixes, are these considered prefixes? That blank "Date" is clearly associated with the Det. whereas the other one is the collection event. But, these are only knowable by their positioning in columns. Columns of data on a single label would also need to be collapsed in verbatimLabel
too, right? How? This gets messy in a real hurry...
And so, whatever the definition of verbatimLabel
, it has to speak to how (or how not) to type stuff in it when faced with considerable variability in source.
@dshorthouse I think there will considerable variability in how data is put into this field, and that's okay. Obviously there are always going to be cultural differences in how labels are read/transcribed, but how is that different from any other DWC supported field? And depending on the discipline, they may or may not choose to use this field. Botanists, from what I understand, seem to like images of specimens much more than transcribed text, precisely for the reasons you give above. They are allowed much more space, have more information, and therefore it's easier to take a photo of their specimen labels (for example, I don't even know how you'd transcribe the formatting of the labels on the botany specimen above, which are rarely done on entomology labels, in order to conserve space. However, entomologists, who put less information on a label in general, find it much easier to just quickly transcribe a label _exactly as it appears on the specimen, in the order it appears on the specimen, as close to as it appears on the specimen as possible, without any interpretation because that is the easiest, quickest way to make sure all the information gets into a field that can later be parsed out.
So, all I'm suggesting is that we have the option to export that field for a variety of reasons. Will there be differences in how it's formatted from museum to museum? Probably. And I think we can mitigate exactly what you are describing by being explicit about best practices. For example, introducing no new formatting into a label except when needed to maintain meaning; or using [marks] to denote any interpretation if, for example, handwriting is uncertain (that's an example).
In all honesty, this field will NEVER be as "standardizable" as something like coordinates, agents, or dates can be (obviously some of those examples are under expansion/discussion. But how great would it be if you always had something to compare all those other parsed fields to. For example, the "agent strings" that get put into "determiner" or "collector" are sometimes only partially complete. Having the verbatimLabel to refer back to can help with that. Excel issue with auto-formatting dates when exported? Check the verbatimLabel. Bad GPS formatting? Check the verbatimLabel.
So, I think all of this discussion is great. So many people have brought up good points. Let's move to the next step because the demand is here, there is community interest, and we should push to the next step.