camtrap-dp Tracking intermediate detection/classification steps

Thank for for the awesome work on this format. I raised the following issue during the seminar of Nov 9th: in some data processing workflows, a first detection and/or classification result is generated by an automated system and then corrected by a human. It looks like you started to think about how to register the fact that multiple steps with a pipe string in #225. But what about keeping track of intermediate results? I think these are valuable to keep track of the performance of the automated system which was used. From my shallow understanding of the schema, one way I see would be to allow to duplicate observations with a different classifiedBy value for each duplicate, but it's probably a bad strategy as it would double the size of the data.

Or maybe you would just recommend to publish two files: one for the classification by AI, one for validation?

Nov 09 '22 15:11 VLucet

Hey @VLucet! Great question! I copy paste my answer from #262:

I see no problem with having multiple rows for the same media until they can all be properly identified with classificationMethod & classifiedBy. This approach also gives you a possibility to have > 1 AI prediction in the mediaObservations table

And then, having multiple observations produced by different classifiers (human- or machine-based) maybe we could indeed consider how to track the entire pipeline of intermediate results. The very first idea that has just come to my mind is:

Add an extra field to the mediaObservations table named e.g. sourceObservation and self-reference it to observationID in the same table; see the example of Frictionless Data self-reference foreign key: https://specs.frictionlessdata.io/table-schema/#foreign-keys

What do you think? @VLucet @peterdesmet

Nov 17 '22 20:11 kbubnicki

@kbubnicki can you clarify this self-reference sourceObservation approach? The two key issues I want solved for allowing multiple assessments are:

Having a unique observationID
Allowing to aggregate multiple rows without over-counting counts

Example

Assuming an event (seq1) with 1 adult wild boar and 2 juveniles, visible in image1.jpg and image2.jpg, but completely out of frame in image3.jpg. Let's assume model A performs poorly, model B identifies everything and human A only bothered validating image1.jpg.

media.csv
mediaID | eventID | filePath
------- | ------- | --------
med1    | seq1    | image1.jpg
med2    | seq1    | image1.jpg
med3    | seq1    | image1.jpg

media-observations.csv

obsID  | mediaID | classifiedBy | scientificName | count | lifeStage
------ | ------- | ------------ | -------------- | ----- | ---------
obs1.1 | med1    | model A      | mammal         |       |
obs1.2 | med1    | model B      | wild boar      | 1     | 
obs1.3 | med1    | model B      | wild boar      | 1     | 
obs1.4 | med1    | model B      | wild boar      | 1     | 
obs1.5 | med1    | human A      | wild boar      | 1     | adult
obs1.6 | med1    | human A      | wild boar      | 2     | juvenile

obs2.1 | med2    | model A      | animal         |       | 
obs2.2 | med2    | model B      | wild boar      | 1     | 
obs2.3 | med2    | model B      | wild boar      | 1     | 
obs2.4 | med2    | model B      | wild boar      | 1     | 

obs3.1 | med3    | model A      | blank          |       | 
obs3.2 | med3    | model B      | blank          |       |

How can I reliable get "med1 contained 3 wild boar"?

Nov 18 '22 13:11 peterdesmet

Addressed in Camtrap DP 0.6 #297.

media obs:

Media observations are not required to be mutually exclusive, i.e. multiple classifications (e.g. human vs machine) of the same observed individual(s) are allowed.

event obs:

Event observations are required to be mutually exclusive, so that count can be summed.

Feb 23 '23 13:02 peterdesmet