dwc umbrella issue related to dwc:basisOfRecord and an Evidence class

At TDWG 2020, the subject of the need for an Evidence class came up several times.

Problems with the term dwc:basisOfRecord were also discussed. It is essentially a bespoke term to indicate the type of a record, but there is ambiguity about what particular resource the "record" is referring to since a single line in a table often contains information about several related resources.

Because of this ambiguity about table columns, this issue is therefore also related to the need for some better mechanism for describing the meaning of columns in CSV files that are part of a Darwin Core Archive. Data Packages and the W3C csv2rdf standard were mentioned in this context.

Given that this is a complex issue involving several interrelated issues, it probably needs a task group to come up with a solution. For now, I'm creating this umbrella issue as a way to document the discussion about these topics.

Ping @timrobertson100 @qgroom @dshorthouse @deepreef

Oct 28 '20 18:10 baskaufs

Related issue "basisOfRecord for Plazi datasets" https://discourse.gbif.org/t/basisofrecord-for-plazi-datasets/2238/3 @dagendresen @ekrimmel @agosti

Oct 28 '20 18:10 baskaufs

From Richard Pyle,

Thanks for looping me in, Steve. I agree with all your points below. These kinds of problems are popping up in a number of different places in the TDWG domain (e.g., literature discussions, Agent discussions, etc.).

I think part of the problem is the way in which people think of the meaning of an “occurrence”. I’ve discussed this with several of you over the years, but from my perspective, there is a general and fundamental misapplication of dwc:Occurrence among many in our community, which I think is largely a quirk of history. DwC started out as a standard for exchanging data about specimens. The greatest perceived value of being able to aggregate specimen data was to get better/more robust answers to questions about “what” (Taxon), “where” (Location) and when (the latter of secondary priority to the former two). Of course, observational data also provides valuable insights into what/where/when, so “Specimen” became “Occurrence” to allow integration, and a need arose to indicate whether a particular instance was based on a vouchered specimen or some sort of in-situ observation. Somewhere in that mix is where I think a lot of people (especially those who started out with DwC in the era when it was about specimens) started equating the “Occurrence” as the “Specimen”.

But the “specimen” is not the “occurrence”! The moment when the specimen (physical) was extracted from nature is the occurrence (abstract). We now have dwc:Organism and dwc:MaterialSample, the former representing the basis of a dwc:Occurrence instance (which I define as an intersection of dwc:Organism and dwc:Event instances); and the latter introduced to accommodate multi-taxon samples, but simultaneously creating a class that actually does represent a “specimen”.

I think this history and confusion has led to the issues we’re having with dwc:basisOfRecord. What started out as a need to distinguish between records based on vouchered specimens, vs. records based on non-vouchered occurrence records, has become an overloaded term that tries to solve several problems (but doesn’t really succeed at any of them). Here is what the DwC quick reference guide says are Examples of basisOfRecord, along with some editorializing from me:

PreservedSpecimen Easy enough – this is how DwC started – to allow exchanging and aggregating data about Museum specimens.

FossilSpecimen OK, I guess this is different from PreservedSpecimen both because it’s not really “Preserved” (or was already preserved prior to extraction from nature?), and because the implications of its underlying Occurrence are somewhat different, in that there are other properties (like those associated with dwc:GeologicalContext). And perhaps also to imply certain assumptions about the organism on which the occurrence is based (i.e., it died long before its occurrence was documented, and may have moved). LivingSpecimen

Needed for things that aren’t dead yet (and, hence, not yet preserved), such as organisms in zoos and aquaria and botanical gardens and cultures and such. Is this because the associated dwc:Occurrence is anchored to existing locations (i.e., the aforementioned zoos and aquaria and gardens), rather the intersection of the Organism with the Event at which it was extracted from nature (as is the case for dead specimens)? What are the implications of tagging records with this value, compared to the other two values above? What assumptions or restrictions on fit-for-purpose evaluations that come with this designation?

MaterialSample A genuine DwC class, but by the DwC definition, this appears to be a superclass of the first two above. I’m not sure how many people regard an instance of LivingSpecimen to be a subclass of MaterialSample (I probably would)..

HumanObservation This is the original “Observation” value, to distinguish an instance of Occurrence from a vouchered specimen. Easy enough.

MachineObservation Same as HumanObservation, except the Human interpreting the observation didn’t participate directly in the Event itself.

Event A genuine DwC Class, not conflated with the others.

Taxon Another genuine DwC Class, not conflated with the others.

Occurrence The superclass for all the others except Event & Taxon? Or maybe just the superclass for the two Observations?

I know this has already been discussed to death, including the overloaded nature of dwc:basisOfRecord and the suggestion of a need to recognize classes and subclasses of these things, and so on. I also know that these are only represented as examples, not a controlled vocabulary of explicit enumeration of allowable values. But I think we can break down these various values into two separate domains.

The first domain is bona-fide DwC Classes (Occurrence, Taxon, Event, MaterialSample). I see theses as logical values for a dwc:basisOfRecord term/property for use in things like star schema, so it’s clear what class of object the associated values apply to. Missing are the other DwC Classes (Organism, Location, GeologicalContext, Identification, MeasurementOrFact, ResourceRelationship, [UseWitIRI?]). We probably don’t even need this as a record-level term (BTW, why is this nested with dwc:Occurrence? Shouldn’t it be

The second domain are qualifiers (subclasses) of either MaterialSample or Occurrence (or possibly Organism?):

MaterialSample

          PreservedSpecimen

          FossilSpecimen

          LivingSpecimen(?)

Occurrence

          HumanObservation

          MachineObservation

          LivingSpecimen(?)

Organism?

          LivingSpecimen(?)

I think the pathway to salvation is to refine the definition of basisOfRecord to be restricted to the names of DwC Classes, so that there is an explicit indicator for each record as to what class of object that record represents. The other things are more aligned with what we’ve been talking about as an Evidence Class. Mostly Evidence underpins Occurrence instances, but it can also underpin Identification instances. The Evidence itself can take various forms, including instances of MaterialSample, Images, “MaterialCitation”, and unvouchered in-situ observations of various kinds (some involving humans, some involving machines, some involving images created by either humans or machines).

I’ve rambled on enough for now, but I completely agree with the following points:

dwc:basisOfRecord needs some clarification in function & purpose It’s probably time to seriously consider dwc:Evidence It’s probably time to revisit the star schema to either modify or replace it We probably need a new Task Group to sort this out (within which Interest Group, though?)

Vaguely related to this, we also need to start seriously considering two more dwc classes (Reference and Agent). We’ve avoided these in the past mostly (I think) because we felt thy were outside of our Scope. But DwC already has a Location class, and the TDWG community has embraced AudubonCore, and we have an Annotation group. All of these areas are covered in much more general contexts outside of TDWG-land, but we created them because we have domain-specific needs associated with them. I think the same is true for Agents and References, and they are somewhat related because they both have a lot of relevancy to a possible Evidence class.

Oct 29 '20 07:10 qgroom

Thanks, @qgroom !

Oct 29 '20 07:10 deepreef

Also relevant... https://discourse.gbif.org/t/understanding-basis-of-record-a-living-specimen-becomes-a-preserved-specimen-gbif-data-blog/1349

Oct 29 '20 07:10 qgroom

There is also an open issue in GBIF-marine on basisOfRecord https://github.com/iobis/gbif-marine/issues/10

Oct 29 '20 07:10 qgroom

Note that the vocabulary GBIF uses for basis of record is not the same as that suggested for Darwin Core https://gbif.github.io/parsers/apidocs/org/gbif/api/vocabulary/BasisOfRecord.html

Oct 29 '20 07:10 qgroom

https://discourse.gbif.org/t/basisofrecord-for-plazi-datasets/2238

Oct 29 '20 08:10 qgroom

Meant to say "basisOfRecord" but "establishmentMeans" came out of my fingers!

Oct 29 '20 10:10 baskaufs

See also https://github.com/tdwg/dwc-qa/search?q=basisOfRecord&type=issues

On Thu, Oct 29, 2020 at 5:02 AM Quentin Groom [email protected] wrote:

https://discourse.gbif.org/t/basisofrecord-for-plazi-datasets/2238

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/302#issuecomment-718453842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ723VTSQKFMVNRZ7DVOLSNEOSJANCNFSM4TCWFM6Q .

Oct 29 '20 13:10 tucotuco

I don't have time at the moment to fully engage in this discussion, but I wanted to make one note about the idea of an Evidence class. In our discussions about minting the "Token" class (which should have been called Evidence because that's exactly how it's used) in Darwin-SW, @camwebb and I considered whether such a class was actually necessary or not. The critical thing is actually to have the term "hasEvidence" to link to the evidence. Whether or not we declare that linked thing as an instance of an Evidence class is secondary to the linking. We can infer that it is evidence by use of the linking property. In the case of Darwin-SW, dsw:hasEvidence has a range of dsw:Token, so the act of using the property entails that the connected resource is a dsw:Tokenautomatically.

The TDWG Vocabulary Maintenance Spec disallows such range declarations as part of the core metadata for terms, so we probably wouldn't do that in a dwc:hasEvidence term. But using the term would still imply that the object of the statement is "evidence" regardless of whether we mint a dwc:Evidence class or not. Most if not all objects would already be instances of some other class like dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc. and there wouldn't be any harm of also declaring them to be in instance of a second class (dwc:Evidence). But it isn't clear to me that anything would necessarily be gained by that.

Oct 29 '20 14:10 baskaufs

I have a subtly different view of this. I don't see the instances of a class Evidence being the actual items you list (dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc.). Rather I see instances of Evidence as actually being the join between an dwc:Occurrence (or dwc:Identification) and the instance of one of those other classes you mention (among others). In this way, I see a proposed dwc:Evidence class as being analogous to dwc:Identification, the latter of which effectively serves as a join between an instance of dwc:Organism and an instance of dwc:Taxon. I didn't explain that well in the stream-of-consciousness email that @qgroom posted on my behalf above.

Oct 30 '20 01:10 deepreef

Sounds like Evidence may be too generic for DwC. If some evidence is used for an Identification and some is used to assert an Occurrence, they'll be evidence(s) of very different natures. But maybe we're getting into the territory of Assertions and their Evidence as more generic classes.

On Thu, Oct 29, 2020 at 10:03 PM Richard L. Pyle [email protected] wrote:

I have a subtly different view of this. I don't see the instances of a class Evidence being the actual items you list (dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc.). Rather I see instances of Evidence as actually being the join between an dwc:Occurrence (or dwc:Identification) and the instance of one of those other classes you mention (among others). In this way, I see a proposed dwc:Evidence class as being analogous to dwc:Identification, the latter of which effectively serves as a join between an instance of dwc:Organism and an instance of dwc:Taxon. I didn't explain that well in the stream-of-consciousness email that @qgroom https://github.com/qgroom posted on my behalf above.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/302#issuecomment-719108492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72YQXERQPNZ3HJWKRCTSNIGGNANCNFSM4TCWFM6Q .

Oct 30 '20 01:10 tucotuco

@tucotuco -- I'm not sure I follow. The context in which I suggested "or dwc:Identification" was based on a realization we had that exactly the same things can serve as evidence for both occurrences and taxonomic identifications. I think the primary emphasis should be focused on "Evidence of Occurrence" (I believe that's the context that Darwin-SW framed Token/Evidence). For example: "This specimen represents Evidence that this Organism occurred at this Event" (i.e., this Evidence supports this Occurrence). You can replace "Specimen" with "Image" or "Publication" ["materialCitation"?] or "Human Observation" or "Machine Observation" (etc.). What occurred to us when implementing this model is that some of these (especially "Image" and "Specimen", but also things like "DNA sequence" and potentially others) not only serve as "Evidence of Occurrence", but also (potentially) "Evidence of Identification". In other words, "This specimen/Image/DNA Sequence represents Evidence that this Organism is identified as this Taxon." And each piece of "evidence" can simultaneously represent both (i.e., both Evidence of Occurrence and Evidence of taxonomic Identification).

It might make more sense to focus only on "Evidence of Occurrence", but I can see a possible role for "Evidence of Identification" as well.

Incidentally, another way to solve this is, instead of creating a dwc:Evidence class, we could just adopt a standard value of dwc:relationshipOfResource (e.g., "hasEvidence"/"isEvidenceOf") and capture all this within instances of dwc:ResourceRelationship. But I could make essentially the same argument for dwc:Identification (with some tweaks). Elsewhere there are murmurings of tweaking dwc:ResourceRelationship to accommodate a broader array of functions (e.g., adding a sequence term to the class to facilitate linking Agents to various other dwc classes).

Indeed... I look forward to the day when ALL relationships between instances from among and within DwC classes are represented this way (i.e., as instances of dwc:ResourceRelationship). But that's probably a topic for another Issue (or Task Group, or Interest Group...)

Oct 30 '20 02:10 deepreef

OK, just to try to clarify how I am understanding the situation, here are two diagrams that I think represent the difference between how Rich is describing evidence and how Darwin-SW describes evidence. (DSW does not use the name "evidence", rather it uses "token" but you can consider the two to be interchangeable in the discussion below.)

Rich's evidence model

If I'm understanding correctly, Rich is saying evidence is a "join" (which I would call a node) between occurrence and the thing that is serving as evidence. The type of the evidence node is dwc:Evidence and the type of the thing that is serving as the evidence is dwc:PreservedSpecimen, but could be any of a number of other things like dcmitype:StillImage, foaf:Document, etc. I suppose a dwc:hasEvidence property would link the occurrence to the evidence instance. I'm not sure what property would link the evidence instance to the resource that is serving as the evidence.

DSW evidence model

The way Darwin-SW models evidence is that any type of thing can be the evidence and the dwc:hasEvidence property would connect the occurrence to the resource that is serving as the evidence. The resource serving as the evidence is going to intrinsically have some type (dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc.) but could also be in instance of the class dwc:Evidence. I am showing the typing as dwc:Evidence in red because it is not clear to me if that statement actually serves any purpose and perhaps is superfluous since the resources serving as evidence already have other more meaningful types. We know that the resource serving as evidence is evidence because we've used the dwc:hasEvidence property to connect it.

There is nothing "wrong" with either of these models. But in my experience, it is best to have the simplest model that allows you to do what you need to do, and no more. There are two reasons why we might need an extra node in the middle like Rich has suggested:

If there are properties that we want to attach to that node that we can't reasonably attach to one of the adjacent nodes (the occurrence or the resource serving as the evidence).
If the node is needed to facilitate one-to-many or many-to-many relationships. For example, if a single "evidence" instance of Rich's type needed to be connected to many resources serving as evidence (as opposed to attaching those many resources directly to the occurrence).

It is not clear to me what the use cases are that would fit either of those two reasons, but Rich may be able to provide them.

Oct 30 '20 13:10 baskaufs

As far as whether evidence can serve as evidence for an occurrence, or an identification, or both, here is another diagram:

use of evidence diagram

The model of the main classes (identification, organism, occurrence, event, "taxon-ish thing") is according to Darwin-SW, which pretty much originally sprang out of Rich's brain, so I think he is probably thinking of those classes as having the same connections.

The question is whether a kind of evidence can be linked to an occurrence or an identification or both. In this diagram, the same property, dwc:hasEvidence is used to make both of the links. In contrast, the DSW model has two separate properties: dsw:idBasedOn (for connecting identifications) and dsw:hasEvidence (for connecting occurrences). Whether or not it is better to have the same property or two different ones again depends on the use cases. Having a single property would be less complicated since it would involve creating only one new property instead of two. However, if one's goal is to be able to query, then using the same property would require adding an additional screen to the query pattern (is the subject resource an occurrence or an identification). It is not clear to me which is better. But this brings us to a critical point in standards development: we should not be setting standards by what seems right in our brains, but rather defining use cases and then deciding whether the proposed solutions satisfy them or not.

There is a separate issue that hasn't come up and that is which direction the arrows should point. In Darwin-SW, we defined inverse properties pointing in both directions, but in retrospect I think that was a mistake since it is a burden to providers to have to figure out how to provide both properties and a burden on queriers to have to design complicated queries in the event that providers don't provide both. (This is why we established "preferred" properties within inverse pairs.) Based on my experience it is much better to have the properties point from the "many" direction to the "one" direction (if it's a one-to-many relationship). Since one occurrence can have many forms of evidence, that would argue towards having the properties point towards the occurrence. However, the same piece of evidence could potentially serve as evidence for several occurrences (e.g. a single image capturing multiple organisms). So that isn't necessarily clear, although if we actually figure out how to annotate parts of images, segments of videos and sound recordings, parts of documents, etc. the granularity of describing the evidence might be such that would could say that generally a single piece of evidence only describes one occurrence.

Oct 30 '20 13:10 baskaufs

Thanks, @baskaufs -- this is VERY helpful! First, full disclosure: our actual implementation looks more like the DSW model than "my" model. And it works very well. However the thing that has bothered me about it is that the "Evidence" table in our RDMS implementation is the thin stack of identifiers that map 1:1 with identifiers in various other tables (Media, Reference, CollectionObject [=generalized PreservedSpecimen], etc.). In effect, this makes our "Evidence" function as a superclass of all those other things. But that seems like a distorted view of the universe (i.e., Images and Specimens and such have more inherent/intrinsic value in and of themselves than simply serving as evidence of other things). Moreover, we still need a join table to represent the M:M relationship between Evidence and Occurrence (i.e., each instance of Evidence in this sense may underpin multiple Occurrence instances, and conversely each instance of an Occurrence may be supported by multiple instances of Evidence), which we call "OccurrenceEvidence". In exposing the values of OccurrenceEvidence via DwC, I had imagined using dwc:ResourceRelationship.

So... if our actual implementation is more or less the DSW approach, why have I offered the alternate approach in this discussion? Well in part for the reasons described above, as well as a few other reasons, it seems much more appropriate to represent Evidence as the "join" between the Specimen|Image|Publication|Etc. and the Occurrence[|Identification?] instance, because that is the specific context in which the Subject instance (Specimen|Image|Publication|Etc.) actually functions as Evidence for the Object instance (Occurrence[|Identification?]). This is why I opened the door to representing instances of what we want to characterize as Evidence within dwc:ResourceRelationship, typed via a specific value for dwc:relationshipOfResource.

It seems to me, the justification for establishing a Class of anything in the context of DwC (or any data standard) is to be able to represent it through an identifier, and to attach key properties to it. So this leads to: what are the key properties we would want to attach to an instance of dwc:Evidence? In our implementation, we don't really have any meaningful properties associated with our Evidence sensu DSW (i.e., "superclass" of those other things). The key properties are actually within our OccurrenceEvidence table -- things like evidenceQuality and isPrimarySubject.

As to specific comments & questions from @baskaufs :

I'm not sure what property would link the evidence instance to the resource that is serving as the evidence.

I would suggest something like isEvidenceOf?

If there are properties that we want to attach to that node that we can't reasonably attach to one of the adjacent nodes (the occurrence or the resource serving as the evidence).

As I noted above, in our implementation we attach properties like evidenceQuality and isPrimarySubject (not by those names, but that's what they represent).

If the node is needed to facilitate one-to-many or many-to-many relationships. For example, if a single "evidence" instance of Rich's type needed to be connected to many resources serving as evidence (as opposed to attaching those many resources directly to the occurrence).

Yeah, this is pretty much a given. Certainly each Occurrence instance (or Identification instance... if we go there) can be supported by multiple instances that function as evidence (e.g., five humans observed the bird at the pond, two photos were taken of it, then someone killed it and preserved it at a Museum). But likewise, an image/video or a publication could serve as evidence for multiple Occurrence instances.

The model of the main classes (identification, organism, occurrence, event, "taxon-ish thing") is according to Darwin-SW, which pretty much originally sprang out of Rich's brain, so I think he is probably thinking of those classes as having the same connections.

Yup! That diagram in your second post works for me!

we should not be setting standards by what seems right in our brains, but rather defining use cases and then deciding whether the proposed solutions satisfy them or not.

I think this is key, so here are some use cases off the top of my head (I can come up with more):

I have a library of underwater video recordings. Most of the individual recordings focus on a particular Organism (e.g., fish), but the video image also captures many other organisms coming in and out of frame. I would like to document as many Occurrence instances as I can based on this video clip, and ensure that each Occurrence can be traced back to the video clip to serve as the foundation for the Occurrence.
I would like to generate a regional checklist of species that occur within a defined area, and I would like to cite all the evidence to support my assertions that each taxon occurs (or has occurred) at the indicated location. The source of the Occurrence instances include specimens, reported human observations, published distributions, in-situ images, and eDNA samples.
I would like to asses the confidence of a taxonomic identification of an organism based on whether the identification was made with a specimen in hand, or from an image of the organism, or from a DNA sequence, or some combination of these.
I would like to filter a list of Occurrence records based on whether they are supported by preserved specimens, in-situ images, published records, or some combination of these things.

There' a LOT to unpack here, and I'm not sure if my rantings are in any way helpful to what this Issue was created for, but I do know that dwc:basisOfRecord does not, by itself, allow me to track the kind of information I would like to track (and share it and/or harvest it from an aggregator in the way I would like to be able access it).

As with so many of these discussions, it's important to separate implementation-specific things from things that are genuinely helpful/important in a data exchange standard like DwC. Also, it's important to focus on actual user needs and actual available data. When I first saw the DSW model, I was satisfied that the first bar had been reached (i.e., it was clear that this idea wasn't restricted to our own implementation). The reason I have stayed quiet on the Evidence class is mostly with respect to the second bar: how much data exists that can actually be represented with this degree of granularity, and who really has a use for it? In my mind that bar was reached via several conversations at both recent TDWG conferences (especially, but not only, discussions related to dwc:basisOfRecord). In other words, I think we may be close to critical mass on a minimum threshold of available data and expressed need that it may be time for this community to "go there" with respect to Evidence.

I don't know about others, but this is exactly the kind of discussion I was hoping would emerge from this. My main fear is that the only ones who find this discussion important and worthwhile are @baskaufs and I.

Oct 30 '20 20:10 deepreef

I find the discussion important and useful, though I do not have the time to contribute much. I can't help thinking about BCO as the discussion progresses. I would feel much more confident modeling the necessary relationship in an ontology and testing the reasoning for rigor than to try to "fit" Darwin Core with Evidence as an initial goal.

On Fri, Oct 30, 2020 at 5:57 PM Richard L. Pyle [email protected] wrote:

Thanks, @baskaufs https://github.com/baskaufs -- this is VERY helpful! First, full disclosure: our actual implementation looks more like the DSW model than "my" model. And it works very well. However the thing that has bothered me about it is that the "Evidence" table in our RDMS implementation is the thin stack of identifiers that map 1:1 with identifiers in various other tables (Media, Reference, CollectionObject [=generalized PreservedSpecimen], etc.). In effect, this makes our "Evidence" function as a superclass of all those other things. But that seems like a distorted view of the universe (i.e., Images and Specimens and such have more inherent/intrinsic value in and of themselves than simply serving as evidence of other things). Moreover, we still need a join table to represent the M:M relationship between Evidence and Occurrence (i.e., each instance of Evidence in this sense may underpin multiple Occurrence instances, and conversely each instance of an Occurrence may be supported by multiple instances of Evidence), which we call "OccurrenceEvidence". In exposing the values of OccurrenceEvidence via DwC, I had imagined using dwc:ResourceRelationship.

So... if our actual implementation is more or less the DSW approach, why have I offered the alternate approach in this discussion? Well in part for the reasons described above, as well as a few other reasons, it seems much more appropriate to represent Evidence as the "join" between the Specimen|Image|Publication|Etc. and the Occurrence[|Identification?] instance, because that is the specific context in which the Subject instance (Specimen|Image|Publication|Etc.) actually functions as Evidence for the Object instance (Occurrence[|Identification?]). This is why I opened the door to representing instances of what we want to characterize as Evidence within dwc:ResourceRelationship, typed via a specific value for dwc:relationshipOfResource.

It seems to me, the justification for establishing a Class of anything in the context of DwC (or any data standard) is to be able to represent it through an identifier, and to attach key properties to it. So this leads to: what are the key properties we would want to attach to an instance of dwc:Evidence? In our implementation, we don't really have any meaningful properties associated with our Evidence sensu DSW (i.e., "superclass" of those other things). The key properties are actually within our OccurrenceEvidence table -- things like evidenceQuality and isPrimarySubject.

As to specific comments & questions from @baskaufs https://github.com/baskaufs :

I'm not sure what property would link the evidence instance to the resource that is serving as the evidence.

I would suggest something like isEvidenceOf?

If there are properties that we want to attach to that node that we can't reasonably attach to one of the adjacent nodes (the occurrence or the resource serving as the evidence).

As I noted above, in our implementation we attach properties like evidenceQuality and isPrimarySubject (not by those names, but that's what they represent).

If the node is needed to facilitate one-to-many or many-to-many relationships. For example, if a single "evidence" instance of Rich's type needed to be connected to many resources serving as evidence (as opposed to attaching those many resources directly to the occurrence).

Yeah, this is pretty much a given. Certainly each Occurrence instance (or Identification instance... if we go there) can be supported by multiple instances that function as evidence (e.g., five humans observed the bird at the pond, two photos were taken of it, then someone killed it and preserved it at a Museum). But likewise, an image/video or a publication could serve as evidence for multiple Occurrence instances.

The model of the main classes (identification, organism, occurrence, event, "taxon-ish thing") is according to Darwin-SW, which pretty much originally sprang out of Rich's brain, so I think he is probably thinking of those classes as having the same connections.

Yup! That diagram in your second post works for me!

we should not be setting standards by what seems right in our brains, but rather defining use cases and then deciding whether the proposed solutions satisfy them or not.

I think this is key, so here are some use cases off the top of my head (I can come up with more):

I have a library of underwater video recordings. Most of the individual recordings focus on a particular Organism (e.g., fish), but the video image also captures many other organisms coming in and out of frame. I would like to document as many Occurrence instances as I can based on this video clip, and ensure that each Occurrence can be traced back to the video clip to serve as the foundation for the Occurrence. 2.

I would like to generate a regional checklist of species that occur within a defined area, and I would like to cite all the evidence to support my assertions that each taxon occurs (or has occurred) at the indicated location. The source of the Occurrence instances include specimens, reported human observations, published distributions, in-situ images, and eDNA samples. 3.

I would like to asses the confidence of a taxonomic identification of an organism based on whether the identification was made with a specimen in hand, or from an image of the organism, or from a DNA sequence, or some combination of these. 4.

I would like to filter a list of Occurrence records based on whether they are supported by preserved specimens, in-situ images, published records, or some combination of these things.

There' a LOT to unpack here, and I'm not sure if my rantings are in any way helpful to what this Issue was created for, but I do know that dwc:basisOfRecord does not, by itself, allow me to track the kind of information I would like to track (and share it and/or harvest it from an aggregator in the way I would like to be able access it).

As with so many of these discussions, it's important to separate implementation-specific things from things that are genuinely helpful/important in a data exchange standard like DwC. Also, it's important to focus on actual user needs and actual available data. When I first saw the DSW model, I was satisfied that the first bar had been reached (i.e., it was clear that this idea wasn't restricted to our own implementation). The reason I have stayed quiet on the Evidence class is mostly with respect to the second bar: how much data exists that can actually be represented with this degree of granularity, and who really has a use for it? In my mind that bar was reached via several conversations at both recent TDWG conferences (especially, but not only, discussions related to dwc:basisOfRecord). In other words, I think we may be close to critical mass on a minimum threshold of available data and expressed need that it may be time for this community to "go there" with respect to Evidence.

I don't know about others, but this is exactly the kind of discussion I was hoping would emerge from this. My main fear is that the only ones who find this discussion important and worthwhile are @baskaufs https://github.com/baskaufs and I.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/302#issuecomment-719795588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ7276OPGIBRZSTY2OD5DSNMSC7ANCNFSM4TCWFM6Q .

Oct 30 '20 23:10 tucotuco

I think @tucotuco is probably right -- we'd need to collectively understand (and agree on) the ontology of this stuff before it makes sense to flesh out a DwC class. A year ago I would have said the ontological representation is pretty solid (i.e., DSW model), but as evidenced by the discussion here, it still needs some clarification (in my own mind, at least).

Edit: "evidenced" not intended as pun, but kinda works out that way.

Oct 30 '20 23:10 deepreef

I enjoyed this thread a lot! Not sure if helpful, but here is a fringe case question that got stuck in my mind.

What happens if two people co-collect/co-observe a dwc:Occurrence together and both are listed together in dwc:recordedBy? We normally record this as one single species dwc:Occurrence - assigning one single dwc:occurrenceID. But if we for some reason want to talk about the observation of the species-occurrence (dwc:Occurrence) by each person separately, would this create two new dwc:Occurrences? (In total three dwc:occurrenceIDs - I'd guess maybe not?) Or might there maybe be a use case for two (or three?) instances of a Evidence here??? [Related - Is the person/Agent needed to make a species-occurrence into a dwc:Occurrence, would it be a dwc:Occurrence if nobody recorded it? (provided Evidence)]

Next, what happens if a third person, maybe not a naturalist at all, maybe eg. a journalist, or an amateur photographer, or another researcher, observes the two collectors (dwc:recordedBy/Agents) making the species observation (dwc:Occurrence?) together? (Maybe publishing the photograph in a newspaper, or a scientific paper). Might even this case generate another dwc:Occurrence, with another dwc:occurrenceID (I'd guess not ??), or might this maybe generate another instance of Evidence??? (Maybe simply a Image made to have the role as Evidence by a hasEvidence statement?) [Might this be similar in any way to the use case of the literature-occurrences that Plazi and BHL works with ??]

+1 @tucotuco for thinking of BCO, however, also +1! @deepreef for basisOfRecord overloaded! and the need for doing something

Oct 31 '20 00:10 dagendresen

Forgive my ignorance in the nuance that is being expressed here, but this feels like turtles all the way down. It reminds me of how Crossref in its early days wrestled with WHAT gets a DOI. At that time publishers were exploring with multiple digital outputs, views, and file formats of scientific articles beyond landing pages and PDFs. They finally settled on a rule: "A Crossref DOI should point to one intellectually discrete scholarly document". And, Crossref then developed a suite of tools and services (coincidentally reused to detect plagiarism) to police the assignment of DOIs to duplicative content. Bad behaviour had real financial consequences for members and could mean getting tossed from the playground. The point here is that Crossref and its members settled on a precise definition for the entities in the collective sandbox & then got on with their business. It's not entirely fair to compare our entities to scientific papers that come ready-made with tidy little citation graphs, but there's also a message here to be ever mindful of how we expect our equivalent "intellectually discrete scholarly documents" to be used and linked. If there is confusion over WHAT are our grains of sand when drawn into the aggregate then we have a very real problem; it does not instil confidence and the citation graph will be a dull mat of cold mud.

Oct 31 '20 01:10 dshorthouse

@dagendresen : We have a similar conundrum in our world: A team of three divers go on a dive together. They stay pretty close to each other during the dive. Each has his/her own video camera, and records a whole bunch of video clips during the dive. Because the divers are close together, they often capture video images of the same individual organisms (e.g., rare fish, big shark swims by, etc.). A not-uncommon circumstance is that two of the divers are filming the rare fish/shark, and the third diver is filming the other two divers filming the rare fish/shark. Scenarios such as this are more the norm than they are the exception in our real-life world.

The most vexing issue for us isn't even about the Occurrences (that's relatively straightforward -- see below). The thing we argue about is: How many Events are in play here? Our Event model is hierarchical, so we have one top-level event for the "Expedition". We then typically generate a "Team Dive" event as a child of the "Expedition" Event. From there, things get squirrelly. In most cases, we don't define any more granular events than this. The main problem is that properties like minimumDepthInMeters/maximumDepthInMeters need to be attached to the Occurrence instance (rather than where they really belong, which is the Location instance -- or at least the Event instance), so we can capture accurate/precise depth values for each Occurrence established during the dive. If we want to avoid that problem, then we need to define lots of sub-events for each "Team Dive", so that the depth values can be correctly assigned. The problem, though, is a massive inflation of Event instances (at least), or Location instances -- because we know the depth with precision, and therefore effectively every video clip becomes its own Event (and Location). Another way to parse out subevents is to create three "Person Dives" as child events to the parent "Team Dive", such that each diver's experience and set of documented occurrence instances represent a distinct Event, separate from the other divers' Event-experiences. And then thing get really complicated when we try to parse events along both pathways, in which case there are potentially hundreds of events on a single dive.

That was a digression from your questions, but I wanted to explain that the definition of an Event comes into play here as well (and also the question of whether properties like minimumDepthInMeters/maximumDepthInMeters apply to the Occurrence, Event, or Location instance).

So getting back to your question, let's ignore how we parse the Event, and focus on the three divers, each with their own video cameras. Two of them find a rare fish and film it, while the third diver films the first two divers filming the rare fish. The way we capture this in the current implementation is as follows (simple form):

1 Event (however parsed it is) 1 Organism (rare fish) 1 Occurrence (Intersection of Organism and Event) 3 Media files (video clips from each of the divers -- you can see the fish in the third diver's video)

In the DSW model, the 3 Media files = 3 Evidence instances -- which is how our current implementation works. But in my current thinking, there are three Media Files which are media files, and three separate instances of "Evidence", one from each Media file linked to the 1 Occurrence (i.e., three records in a join table between Media instance and Occurrence instance).

Now, let's review how we actually capture this in the current implementation: 1 Event (however parsed it is) 4 Organisms (one rare fish, three humans) 4 Occurrences (one for each of the four Organisms at this Event) 3 Media files (video clips from each of the divers)

In the existing (DSW-like) implementation, we have 3 Evidence instances -- representing each of the 3 video files (as above). However, in the model representing my current thinking, we have at least 5, and perhaps 6 instances of Evidence:

Intersection of first diver's video of rare fish
Intersection of second diver's video of rare fish
Intersection of third diver's video of rare fish
Intersection of third diver's video with first diver
Intersection of third diver's video with second diver
Presence of third diver at Event, evidenced either via recordedBy (per our earlier discussion on Agent As Organism), or via HumanObservation.

Now, my answers to your specific questions:

But if we for some reason want to talk about the observation of the species-occurrence (dwc:Occurrence) by each person separately, would this create two new dwc:Occurrences?

No. The Occurrence is the intersection of the Organism and the Event; so only one Occurrence no matter how many observers or pieces of evidence.

Or might there maybe be a use case for two (or three?) instances of a Evidence here???

Potentially. I don't think we would bother parsing out each person's observation as a separate piece of evidence; but I guess you certainly could if you wanted to.

[Related - Is the person/Agent needed to make a species-occurrence into a dwc:Occurrence, would it be a dwc:Occurrence if nobody recorded it? (provided Evidence)]

In the absence of any evidence, how would we ever know to generate the data record? (If a tree falls in the woods it does make a sound, but someone needs to document that sound in order to record the circumstance of its falling.)

Please note: the scenarios I described above are NOT edge cases -- in fact, they are representative of the majority of our data related to video-based occurrence records. As I was typing all that, I was thinking about how to show an example, and one occurred to me: https://www.youtube.com/watch?v=3fI2QxUAv1g. That one is actually almost perfect for the hypothetical presented by @dagendresen . We have three naturalists gathering biological data, one in the form of video (John Earle) and two collecting specimens specimens (Richard Pyle and Brian Greene). We also have a journalist (Bob Cranston; BBC cameraman) and his assistant holding the lights (Peter Kraugh). John Earle's video is focusing on the non-human organisms, whereas Bob Cranston's video is capturing both the human organisms and the non-human organisms.

I don't remember whether we ever processed this particular set of video clips, but there are potentially hundreds of Organisms (if you count all the corals and fishes), five humans, and several lines of evidence (videos, observations, collected specimens). Interestingly, I think that at least three of the collected specimens in that series of videos ended up as Holotypes (for Chromis abyssus, Proganthodes geminus, and Tosanoides annepatrice). It might be fun to use this set of video clips to explore how we would capture all the relevant information, so that if I asked the question "What species of fishes live on deep coral reefs in Palau?", I would be able to include all these lines of evidence as the foundation for Occurrence instances to build my checklist.

It would also be fun to explore other questions, like: How do the video clips function as Evidence of Identification as well as Evidence of Occurrence? Does the audio portion of the video (e.g., my helium voice proclaiming "Prognathodes" and "abei") count as separate pieces of evidence-of-identification from the image captured in the video? If we extract a frame from the video and publish it as a still image (as we have), does that count as a separate piece of evidence? It is a separate media file, after all - even though it's contained within the "parent" video media file. And here's a good one: Surely this represents a legitimate example of evidence-of-identification. But should it also serve as an instance of evidence-of-occurrence? What if instead of a fin clip, the sequence was obtained from an eDNA analysis of a water sample take at the same event?

My head is about to explode, so I'd better stop here.

Oct 31 '20 02:10 deepreef

@deepreef When asking if the Agent is needed to make an Occurrence, I had in my mind (the misconception?) that the Occurrence was the intersection of the organism at a place and time (Event) and when recorded by an observer (what-organism, where-location, when-eventTime, whom-recordedBy). You teach me here that the whom-recordedBy is not part of this scenario?

At the GBIF nodes meeting in Portugal I was playing with this together with a colleague. We both made a photo (for iNaturalist) of the identical same butterfly larve eating the idetical same brassica plant at the very same time (we counted down before pressing the camera button). We wanted to play with the idea of if this created one or two Occurrences.

My Occurrence: https://www.inaturalist.org/observations/3031052 My colleagues Occurrence: https://www.inaturalist.org/observations/3029206

We tied them together by attempting to machine-tagging them with the same eventID. In light of your model of the Occurrence as the intersect of the Organism and the Event, I have learnt now that these two (instances of Evidence?) are for the same Occurrence?

The origin of our thought game was at the time also in part that different observers might not even be aware of the other declaring a dwc:Occurrence for the shared species-observation. In the real-world, we imagined a large group of bird-watchers flocking to a site where a rare bird had been reported. (In Norway I learnt there is a SMS message service that bird-watchers subscribe to and that they might travel far to watch a bird). Maybe each bird-watcher will declare their own dwc:Occurrence for the same bird in a given citizen science platform to photo-voucher the evidence of including the bird-species on their individual list of birds-species they have seen. In this competitive bird-watching space, would we instruct them that ONLY the first reported sighting of a bird count (as the Occurrence)? Are a few seconds/minutes/hours between recordings anyway sufficient to count as distinct Occurrences? In my mind, I thought that the different observer in recordedBy by itself was sufficient :-)

The other scenario with your BBC cameraman and the potential hundreds of organisms recorded in the video, that COULD be parsed out -- makes me think of the 2014 GBIF Ebbe Nielsen winner Vijal Barve investigating if images shared on social media such as Flickr, Facebook, etc could be untapped sources of occurrence-data.

[Apropos your digression example -- Is it the depth of the video-camera or the inferred depth of the organism in the occurrence that is the most appropriate attribute value here? Maybe even both? However, do you always need to explicitly declare all Location nodes just because you have precise depth measurements? Even if these distinct Locations are evident from having the depth reading. It is indeed cumbersome but possible to talk about these Locations eg. as "the location associated with the MeasurementOrFact with measurementOrFactID = urn:uuid:nnn".]

I hope this is still useful for the topic of the tread on basisOfRecord and Evidence. My main interest in the tread is rather (than the above) the distinction between the classes in the basisOfRecord vocabulary -- and in particular how to describe specimens as MaterialSample and as Evidence for an Occurrence.

Oct 31 '20 08:10 dagendresen

... so if the Occurrence is the intersection between the Organism and Event, then we do need something else such as a new Evidence class for all the real-world things that have occurrenceID today??

... would maybe (not saying I think it is) an alternative possibly be a new class OrganismEvent (OrganismOccurrence, SpeciesOccurrence or similar) and renaming Occurrence to Evidence or OccurrenceEvidence (...). [Because Occurrence maybe is misused (?) for very many of the things that currently have occurrenceID assigned?]

Apropos multiple occurrences in the same photo/video (Evidence) -- the same museum specimen - the thing with a catalogNumber - can also be the evidence of multiple Occurrences, when two plants are mounted on the same paper (to save paper) ... or when we start to extract DNA evidence of microorganisms on/inside the specimens, or collect Salmon louse from the ichthyology fish collection [2] that next are accessioned with their own catalog numbers --- and thus PreservedSpecimen as Occurrence does not work here??!

Oct 31 '20 09:10 dagendresen

@dagendresen :

When asking if the Agent is needed to make an Occurrence, I had in my mind (the misconception?) that the Occurrence was the intersection of the organism at a place and time (Event) and when recorded by an observer (what-organism, where-location, when-eventTime, whom-recordedBy). You teach me here that the whom-recordedBy is not part of this scenario?

I'm not sure I follow. In my mind, the Occurrence is the intersection of the Event+Organism. So maybe a better way to answer your question is: Gazillions of Occurrences exist every moment, but only a tiny subset of them get into our databases -- and in most cases, that tiny subset corresponds to the ones where: 1) an Agent was present; and 2) the Agent documented the Occurrence in a form that finds its way into our databases (and thereby gets issued an occurrenceID). This assumes Machines can count as Agents, and does not take into account the side discussion we had about Agents as Organisms).

recordedBy is certainly an important property of an Occurrence, but I wouldn't call it a definitive one. Definitive = Event+Organism. If we take Event as "Where+When", and Organism as "What", then we have Occurrence=Where+When+What. The "ByWhom" part is important, but not definitive (and perhaps plays more into Evidence).

In your butterfly example, I would consider it to be one (non-human) Occurrence (Where+When+What, with the "What" being the butterfly). You now have two instances of Evidence (two photos). Or four if you want to add the two HumanObservations (kind of redundant, but still Evidence). And if you killed the butterfly and preserved it in a Museum, you could add a fifth instance of Evidence (PreservedSpecimen).

In light of your model of the Occurrence as the intersect of the Organism and the Event, I have learnt now that these two (instances of Evidence?) are for the same Occurrence?

That's how I would model it, yes. Or, I guess I should say, that's how our current implementation models it. As per our side conversation, I might be more inclined to model it as three Occurrences -- two human and one non-human Organisms intersecting at the same Event. It would have been great if you were on opposite sides of the butterfly such that your image-based evidence captured each other in the frame as well as the Butterfly!

The origin of our thought game was at the time also in part that different observers might not even be aware of the other declaring a dwc:Occurrence for the shared species-observation.

Indeed, this is something we deal with not infrequently in the real world as well. Two divers at the same place and time each record the same fish with their respective cameras, but not at the same moment (e.g., one on the way down, and the other on the way back up). By default, the fish is assumed to be a different organism for each of the two video clips. However, we sometimes discover that both divers captured video of the same individual fish, in which case we collapse the two organisms as the same, and usually our events are defined broadly enough to be the same as well, which means that the two occurrences also collapse as the same. If we decide to separate the two divers' dives into separate events (as mentioned previously), then we still collapse the Organism instance into one, but the Occurrences end up as separate (Same "What", but possibly different "Where" and/or "When").

In your bird scenario, the real-word problem is that these DO get reported as multiple distinct Occurrences, which can give the false impression to the data consumer that 20 different individuals the same same rare bird occurred at the same place and (roughly) the same time. It would be nice in such cases to have a global mechanism to collapse the dwc:individualID value (= identifier for the Organism instance) so that it isn't misleading in the aggregated data. Whether or not the Events are also collapsed into one (resulting in a single Occurrence instance) depends on how granular one wants to be in defining Event boundaries (dwc:eventTimeUncertaintyInSeconds, anyone?)

In any case, one of the main reasons why we (and, I assume @baskaufs and others in DSW context) recognized the need for a "Token"/"Evidence" class was to deal with exactly this issue -- i.e., that there can often be multi0ple lines of Evidence to document the same Occurrence instance.

makes me think of the 2014 GBIF Ebbe Nielsen winner Vijal Barve investigating if images shared on social media such as Flickr, Facebook, etc could be untapped sources of occurrence-data.

Not only could they be, they absolutely are! About 10 years ago, Rob Whitton and I conceived a plan to build a crowdsourcing platform on Explorer's Log to do exactly this sort of thing, but we never followed through. We may yet, though...

Is it the depth of the video-camera or the inferred depth of the organism in the occurrence that is the most appropriate attribute value here?

Technically (and in the ideal scenario), it's the depth of the dive computer on the diver's rebreather, time-synched with the timestamp on the video camera. But we assume +/- a couple meters, and the diver is usually horizontal to, and within a couple meters of, the subject. In rare cases where there is a meaningful difference, we estimate and record the depth of the Organism, not the depth of the diver (unless the diver is the Organism...)

However, do you always need to explicitly declare all Location nodes just because you have precise depth measurements?

No. In fact, we usually don't. That's why we "cheat" and record the depth at the Occurrence. Rob and I argue about this a lot -- I want to at least push it to Event (if not Location), but Rob doesn't want to populate a gazillion nearly identical Event (& Location) records to the point where they approach 1:1 with Occurrences. My counterpoint to him is that if we do ever extract those hundreds of "other" Occurrences from all those video clips, we will no longer suffer a near-1:1 ratio of Occurrence & Event (or even Location). As an aside, we've decided internally that -- for now at least -- "Location" describes a two-dimensional footprint on the surface of the earth, and any depth/elevation values (z-axis, 3rd dimension) are properties of the Event, not the Location. Yet another topic for another thread.

... so if the Occurrence is the intersection between the Organism and Event, then we do need something else such as a new Evidence class for all the real-world things that have occurrenceID today??

This gets at the heart of why I've been thinking about this for more than a decade, but am only making noise about it now. I don't know if the TDWG community is "There" yet. We do progress over time (we're a LONG way from where we were in the early days of DiGIR). But if you try to push things too hard/too fast, they sometimes break. We'll see if the discussion on Evidence as a Class takes root this time, or needs to go back into hiding for another few years or a decade or so.

... would maybe (not saying I think it is) an alternative possibly be a new class OrganismEvent (OrganismOccurrence, SpeciesOccurrence or similar) and renaming Occurrence to Evidence or OccurrenceEvidence (...). [Because Occurrence maybe is misused (?) for very many of the things that currently have occurrenceID assigned?]

I would regard that as the greater of two evils.

Apropos multiple occurrences in the same photo/video (Evidence) -- the same museum specimen - the thing with a catalogNumber - can also be the evidence of multiple Occurrences, when two plants are mounted on the same paper (to save paper) ... or when we start to extract DNA evidence of microorganisms on/inside the specimens, or collect Salmon louse from the ichthyology fish collection [2] that next are accessioned with their own catalog numbers --- and thus PreservedSpecimen as Occurrence does not work here??!

Exactly. MANY examples exist where one MaterialSample instance includes multiple Organisms. What we call a "Specimen" is vague, but even in the traditional sense, parasites are an obvious example (until they are removed from the host and cataloged separately.

I'm going to assume at this stage that @dagendresen and I are the only ones actually following this discussion, and I therefore apologize to everyone else. But all this stuff needs to be discussed somewhere, some time, and at some point, and it is directly tied to the "Problem" of dwc:basisOfRecord. Maybe this is not the right time or place (=Event) to have this in-depth discussion. But I wouldn't be spending a non-trivial part of my Saturday morning banging away at it, if I didn't think it was (ultimately) important for our community.

Oct 31 '20 20:10 deepreef

Many thanks for engaging!! This tread is very educational for me!

In my example when we, at the time, were thinking of what=Taxon (=scientificName) + where=Location (=decimalLatitude+decimalLongitude) + when=eventTime (or eventDate) + agent=recordedBy as the immutable "data" that decided what was the same Occurrence (to be identified by the same occurrenceID) our thinking was MUCH less complex than your thinking!! And also rather influenced by trying to make sense of how we observed the concept of Occurrence was applied and used for real-world datasets [more bottom-up from data and less top-down from ontological thinking].

Nov 01 '20 10:11 dagendresen

I wish to return to my very first experience of problems to use basisOfRecord. This was when trying to link seedbank collections data to GBIF (for me starting from back in 2004). At this time there were no basisOfRecord = LivingSpecimen or basisOfRecord = MaterialSample yet. However, also later I have always found LivingSpecimen to be much more suitable to botanical garden specimens than to seedbank "specimens" or Accessions as the seedbanks normally call them. And many more possibilities with MaterialSample.

Brief summary of this use case:

For originally in situ wild or on-farm source material (1) seeds are collected in situ in the wild or "in situ"/on-farm from a regionally localized traditional farming context. This material aligns well with the Darwin Core concepts; and in situ/on-farm collected seeds as dwc:Occurrence works fine (except from some missing terminology addressed in the Darwin Core Germplasm extension, https://doi.org/10.17161/bi.v8i1.4095 & https://doi.org/10.13140/2.1.1207.3923).

The Bioversity collecting mission database (https://doi.org/10.15468/ulk1iz) holds examples of such in situ and on-farm material.

Next (2) the collected seed material is multiplied through seed multiplication ex situ (grown on lands at agricultural field stations and new more numerous seeds harvested) and included in a seedbank -- not way too different from museum collections in function. These seedbank seed samples as PreservedSpecimen or LivingSpecimen (or rather MaterialSample) is more or less reasonably acceptable.

The European Genetic Resource Search Catalog (EURISCO) (https://doi.org/10.15468/a3lnmd) holds examples of seed bank accessions.

BUT next (3) seed samples are distributed to other parties and very often also to other seedbanks. These seed samples are assigned other "catalog numbers". Other public seedbanks assign new accession numbers and unfortunately too often lose the provenance link to the parent seed sample material (more often because of unreliable material identifiers than the lack of trying). Private crop breeding companies assign breeding-line numbers and start a genetic selection for a reduction of genetic diversity to fit agricultural needs - and at the same time also an increase in genetic diversity by crossing with other breeding lines. Differences both between different seedbank accessions and also against breeding-lines are of vital importance here. The identity of these derived material seed samples is ultimately MUCH more important here than the link to the original source material that is more appropriately representing the Occurrence concept!

The UN FAO ITPGRFA Global Information System (https://ssl.fao.org/glis/) holds examples of seed distribution for public seedbank accessions. [All public seedbank material distributed is identified by a machine-readable DOI, each time it is distributed]

Furthermore, when (4) seedbank Accessions and the breeding-lines result in a new (commercial -- or public pre-breeding) cultivar, seed material from the cultivars/varieties enter public seedbank when licensing periods end and/or cultivars are no longer in the market for sale. And thus further again used as a new raw material in breeding programs towards yet another new (commercial) cultivar. [Something new is clearly created here that is no longer the same as the source thing identified by the original occurrenceID]

If the seed MaterialSamples were to simply be Occurrences of type basisOfRecord = PreservedSpecimen or LivingSpecimen then the full line of parent-child decedents from the in situ/on-farm source material down to the seed sample Accessions and breeding-lines would share the same occurrenceID identity???? [At this time there was NO MaterialSample and no materialSampleID in Darwin Core, which might help a-lot!!]

This is my rationale and why I came up with a huge problem of accepting "Specimens" as Occurrences - or rather seed material samples as Occurrences.

@deepreef, sorry to just throw out another complex use case. But thought it might be useful to declare my actual primary interest in basisOfRecord issues and thus my primary interest in this thread. [PS: I would not have the conscience to go so deep in this thread during charged working hours at the museum - the weekends are my window for this type of fun]

Nov 01 '20 12:11 dagendresen

@dagendresen : Thanks for the detailed use case! I've never considered a use case like this before, so I found it helpful to see how well my own thinking of these various entities/classes work when modelling a novel (to me) situation as you describe.

I think I understand your description to involve several generations of the plants -- correct? In other words, Material (1) seeds from wild/on-farm are collected at Event (Location+Time) 1. Collectively, the seeds represent an instance of "Organism" (which accommodates more than one individual, when appropriate), and their presence at the collection Event constitutes an Occurrence. I'll call these Organism1, Event1 and Occurrence1.

If I understand correctly, these same seeds are moved to a different location at a different time at (2), which means the same Organism1 intersects with a new Event (Location+Time; Event2), and hence yields Occurrence2. Correct?

This next step is where I'm a bit unclear. Do I correctly interpret this part:

collected seed material is multiplied through seed multiplication ex situ (grown on lands at agricultural field stations and new more numerous seeds harvested) and included in a seedbank

...to mean that the originally collected seeds (Organism1) are germinated and grown and bear new seeds of their own, and those new seeds are then harvested for distribution to a seedbank? If so then this second generation of seeds represent a new instance of Organism (Organism2), and the same place but different time (Event3), and hence represent a new Occurrence3? We could also document the original (now grown) Organism1 at Event3 as representing Occurrence4.

Your step 3 seems to be a situation where Organism2 is now relocated again (other parties/seedbanks), at a new Location+Time (Events), and new Occurrences accordingly. Each new generation represents a new Organism, and each new documeted instance of the Organism at Location+Time (Event) constitutes a new Occurrence.

It seems to me that the key piece of information you need to track is the pedigree of the Organisms. I'm sure Zoos with breeding programs and in-situ evolutionary/ecological studies need to track this same kind of information, and there are at least two ways to do that in DwC: either in a simple way via dwc:associatedOrganisms, or in a more structured way using dwc:ResourceRelationship, with a value of dwc:relationshipOfResource something like "mother of" (as given in the example here).

The other issue raised by your use case is the fuzzy boundary between dwc:Organism and dwc:MaterialSample. I still wrestle with that, and I honestly haven't figured out yet how to draw that line, other than "alive" vs. "dead". I'm not sure if that question is part of this Issue, or needs to be explored in a new Issue.

Nov 01 '20 18:11 deepreef

@deepreef Yes, several generations of plants (plant populations of similar-ish genetic variation). The plant material (seeds) in seedbanks are conserved and treated as the same Accession (same accession number, aka dwc:catalogNumber) through several generations. The Accession is grown ex situ for multiplication of seed stocks at new locations, with the goal to be maintained as genetically similar as possible.

I was until now thinking of the same Accession (aka genotype) through several generations, grown again at new locations, as still remaining the same Organism - as is the tradition at seedbanks. Considering each new intersect of the "Accession" as Organism with a new Event (ex situ) in a field station as a new Occurrence is what I did not quite dare to do, but I DO AGREE, it makes very good sense. [Did the definition of Occurrence change to no longer require that the organism occurrence is in its natural habitat in situ? Or did I just fool my mind with the assumption of such a limitation?]

I believe that seedbanks already normally keep track of the harvest year for all material that is distributed. So catalogNumber + harvest year makes identifying this new "Occurrence2" quite possible -- and if modeled so also the "Organism2" can thus be identified in practice. The thought of a new Organism for each harvest year is new to me (I might still remain inclined to model the material in step 2, new generations of plants at the same seedbank, as the same Organism - but I do think this model is possible to make based on the current metadata already documented in the seedbank databases). In my head it is step 3 which makes a new Organism.

The pedigree is also maintained in seedbank databases but only per accession number (or often only per cultivar name) and most often not per harvest year (generation). With the rather recent new UN FAO ITPGRFA GLIS system assigning new DOIs for each seed transfer it is from now onwards absolutely possible to reconstruct the pedigree to the accession + harvest year.

This was definitely a new way to look at the model! Thanks a zillion!!!

Nov 01 '20 18:11 dagendresen

Thanks for the clarification & confirmation.

Also, material conserved as the same Accession (same accession number, aka dwc:catalogNumber) includes as time pass by (and seeds are distributed), seeds from several generations, but maintained as genetically similar as possible.

Yeah, there is wide variation in how things like Accession numbers and Catalog numbers are mapped to "things" in our world. I think this is part of why we have problems with dwc:basisOfRecord, and also why the so-called "DarwinCore Triplet" is often ineffective as a unique identifier. I've decided that in our system, catalog numbers and accession numbers get attached to instances of dwc:MaterialSample -- so however we resolve the relationship between dwc:MaterialSample and dwc:Organism will inform how things like Accession Numbers and Catalog Number relate to Occurrence instances.

Considering each new intersect of the Organism1 with a new Event (ex situ) in a field station as a new Occurrence2 is what I did not quite dare to do, but I DO AGREE, it makes very good sense. [Did the definition of Occurrence change to no longer require that the organism occurrence is in its natural habitat in situ? Or did I just fool my mind with the assumption of such a limitation?]

Yeah, I've always been unsure about this whole "Occurrence presumes natural habitat" thing. Since the early days of DwC, I believe it has been used for tracking things in Zoos and Botanical gardens and such. And there is a gradation from Native-->Naturalized-->Introduced by Humans-->Cultivated/in gardens-->Captive-->preserved in Museum -- which makes it unclear where "natural" ends and "artificial" begins (this is much more in the domain of dwc:establishmentMeans -- maybe the original title for this Issue by @baskaufs was a Freudian slip?)

But to address this question, I'll share another "Alice in Wonderland" (AIW) implementation/thought experiment we've explored -- i.e., using Occurrence instances to track the movement of Organisms (and potentially MaterialSample instances as well) both in nature and outside of nature. So here's another use case, again involving fish and video, partially based on real-world, and partially based on hypothetical.

Start with the real-world, which you can see here. There are two Events represented in this video, both at the same Location, but different times. The first was 11 October 2011 (start-02:48 in the YouTube video), which we'll call Event1. The second (02:48-end in the video) was 19 August 2012, nearly a year later (Event2). There are a bunch of Organisms in the videos, but at least two of them are shared between the two different Events (the black & white colored butterflyfish, which is an individual of Chaetodon tinkeri; and the dusky yellow fish with the dark smudge at the back, which is a hybrid between C. tinkeri and C. miliaris). We'll call them Organism1 and Organism2, respectively, but let's focus on the hybrid in particular (Organism2). The linked video gives us a pretty straightforward scenario, where the same Organism participates in two separate Occurrence instances, with the same Organism instance (Organism2) occurring at two separate events (Event1 & Event2). These two Occurrence instances (Occurrence1 & Occurrence2) are linked to each other by virtue of sharing the same dwc:organismID (what we used to link via dwc:individualID, but I see now that's been deprecated?). [As an aside, this video is another good example of humans as both recordedBy and Organisms participating in Occurrences... but I'll leave that alone.]

OK, so far, so good. Now I'm going to shift into hypothetical. Suppose the hybrid was later captured and put in an aquarium (it actually did disappear, so this may have actually happened). For the sake of argument, suppose these photos represent the same Organism2 from the video (they don't, but just pretend they do).

How do we track these aquarium photos? Most people would probably link the images to an Occurrence instance representing the Event where the fish was extracted from nature (i.e., the collecting event). But in our Evidence paradigm, the photos DO NOT serve as Evidence of that Occurrence. Rather, they serve as Evidence of Occurrence of the fish in an aquarium some time after the collecting Event, and at a completely different Location. So let's call Occurrence3 the intersection of this Organism2 (hybrid butterflyfish) and the Location+Time where it was collected and extracted from Nature (Event3), and we'll call the images Evidence of Occurrence4, which happened at a different Location and later time (Event4) than the collecting event (Event3).

This is a little weird from the perspective of DwC tradition, because now we're introducing an Occurrence instance that is separate from the "organism extracted from nature" Occurrence, which is what most specimens are treated as. But still not too weird, because we still have the same organism (Organism2, the hybrid butterflyfish), and it's being document across different Events (different Locations & Times).

But now comes the deeper AIW scenario. Let's say the same butterflyfish (Organism2) is euthanized and preserved in a Museum collection. Soon after it is dead, but before it is preserved in a Museum, the specimen is photographed. (Again, this is actually a different fish from the U/W video and the aquarium photos, but pretend it's the same fish for the sake of this use case). The reason this gets into AIW (rabbit hole) territory is this: Is the subject of this photograph the same Organism2 instance, or is it a new dwc:MaterialSample instance? Or is it both? At what point did the dwc:Organism become an instance of dwc:MaterialSample? And what are the implications of the images? Do we link the image of the dead specimen to the Organism2 instance? Or to the MaterialSample1 instance? And how do we link Organism2 to MaterialSample1? Also, the dead-fish images can serve as Evidence of Occurrence where+when the fish was photographed, but does that mean that an Occurrence can also be represented as an instance of Event+MaterialSample (as opposed to Event+Organism)?

I could easily go on for many more paragraphs about how this scenario can dive even deeper into the AIW rabbit hole (derivative tissue samples, resulting DNA sequences, etc.). But the key point here, which comes back to what you posted, is that we have considered (but not yet implemented) treating all movements of MaterialSample items (e.g., loans, movements to different storage areas, etc.) as Occurrence records. We've held back mostly because of this uncertainty about how dwc:Organism instances relate to dwc:MaterialSample instances. As I said, our working definition is "live" vs. "dead" ["and preserved"?].

I know @tucotuco is absolutely right -- that this conversation belongs in an ontology discussion space. But as I said before, I think this stuff is of fundamental importance to moving forward with DwC -- both in terms of defining terms and classes, and (perhaps more importantly) in terms of how people populate shared datasets, and what assumptions are made about how to interpret the data that are shared. It really needs to be discussed and documented somewhere, even if this is not the right place. But it is certainly relevant to dwc:basisOfRecord, so it's not completely out of scope here.

I have some more ideas about all this, but my Sunday agenda has other tasks on it that I must tend to.

Nov 01 '20 20:11 deepreef

In my humble opinion/understanding, the dead fish is still the SAME Organism2 as it was when it was alive. And the MaterialSample1 instance starts when the fish is included in a scientific collection (including at the point in time when it was put in the aquarium) -- and not at the point in time when it was euthanized. But am also willing to think differently!

dwc:Organism A particular organism or defined group of organisms considered to be taxonomically homogeneous.

dwc:MaterialSample A physical result of a sampling (or subsampling) event. In biological collections, the material sample is typically collected, and either preserved or destructively processed.

dwc:Occurrence An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time.

At the museum in Oslo, we have agreed to use the same organismID to link a record for a PreservedSpecimen to and between one or multiple records for the MaterialSample(s) in the DNA tissue bank -- all of them dead things linked together by the same organismID.

Here I found an example of 6 tissue samples from the fish collection in Oslo linked to each other by the same organismID = urn:uuid:74ea1770-c89f-5e72-9cfd-3e1a14c1d27c ---- I notice that each MaterialSample has a different and individual occurrenceIDs as is required by IPT even though they most likely originate from the very SAME in situ Occurrence as the intersection of Organism and the Event when the Organism/MaterialSample(s) were sampled from nature. This particular swordfish is, by the way, indicated to be sampled at a popular beach in Oslo city (urban area), and from the photo, it looks to been found already quite dead. [Was the swordfish thus already a MaterialSample and not an Occurrence already before discovered and sampled?? And thus no occurrenceID at all to be documented???].

NHMO-DFH-782

Which demonstrates a core problem with basisOfRecord (?!). When the type of the record is basisOfRecord = MaterialSample, the class of the thing the record describes is MaterialSample and not the in situ Occurrence ??! If the basisOfRecord = MaterialSample, PreservedSpecimen, etc, should remain possible, then occurrenceID must be optional!! --- while occurrenceID is mandatory in IPT.

dwc:basisOfRecord The specific nature of the data record.

dwc:occurrenceID An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.

dwc:materialSampleID An identifier for the MaterialSample (as opposed to a particular digital record of the material sample). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the materialSampleID globally unique.

Accepting that the fish in the aquarium is also an Occurrence is absolutely brilliant!! [with respect to the analog implication for seedbank accessions]

Nov 02 '20 16:11 dagendresen