Tracking intermediate detection/classification steps
Thank for for the awesome work on this format.
I raised the following issue during the seminar of Nov 9th: in some data processing workflows, a first detection and/or classification result is generated by an automated system and then corrected by a human. It looks like you started to think about how to register the fact that multiple steps with a pipe string in #225. But what about keeping track of intermediate results? I think these are valuable to keep track of the performance of the automated system which was used. From my shallow understanding of the schema, one way I see would be to allow to duplicate observations with a different classifiedBy value for each duplicate, but it's probably a bad strategy as it would double the size of the data.
Or maybe you would just recommend to publish two files: one for the classification by AI, one for validation?
Hey @VLucet! Great question! I copy paste my answer from #262:
I see no problem with having multiple rows for the same media until they can all be properly identified with
classificationMethod&classifiedBy. This approach also gives you a possibility to have > 1 AI prediction in themediaObservationstable
And then, having multiple observations produced by different classifiers (human- or machine-based) maybe we could indeed consider how to track the entire pipeline of intermediate results. The very first idea that has just come to my mind is:
Add an extra field to the mediaObservations table named e.g. sourceObservation and self-reference it to observationID in the same table; see the example of Frictionless Data self-reference foreign key: https://specs.frictionlessdata.io/table-schema/#foreign-keys
What do you think? @VLucet @peterdesmet
@kbubnicki can you clarify this self-reference sourceObservation approach? The two key issues I want solved for allowing multiple assessments are:
- Having a unique
observationID - Allowing to aggregate multiple rows without over-counting counts
Example
Assuming an event (seq1) with 1 adult wild boar and 2 juveniles, visible in image1.jpg and image2.jpg, but completely out of frame in image3.jpg. Let's assume model A performs poorly, model B identifies everything and human A only bothered validating image1.jpg.
media.csv
mediaID | eventID | filePath
------- | ------- | --------
med1 | seq1 | image1.jpg
med2 | seq1 | image1.jpg
med3 | seq1 | image1.jpg
media-observations.csv
obsID | mediaID | classifiedBy | scientificName | count | lifeStage
------ | ------- | ------------ | -------------- | ----- | ---------
obs1.1 | med1 | model A | mammal | |
obs1.2 | med1 | model B | wild boar | 1 |
obs1.3 | med1 | model B | wild boar | 1 |
obs1.4 | med1 | model B | wild boar | 1 |
obs1.5 | med1 | human A | wild boar | 1 | adult
obs1.6 | med1 | human A | wild boar | 2 | juvenile
obs2.1 | med2 | model A | animal | |
obs2.2 | med2 | model B | wild boar | 1 |
obs2.3 | med2 | model B | wild boar | 1 |
obs2.4 | med2 | model B | wild boar | 1 |
obs3.1 | med3 | model A | blank | |
obs3.2 | med3 | model B | blank | |
How can I reliable get "med1 contained 3 wild boar"?
Addressed in Camtrap DP 0.6 #297.
media obs:
Media observations are not required to be mutually exclusive, i.e. multiple classifications (e.g. human vs machine) of the same observed individual(s) are allowed.
event obs:
Event observations are required to be mutually exclusive, so that count can be summed.