bio_data_guide icon indicating copy to clipboard operation
bio_data_guide copied to clipboard

recommendation RE: eventID / occurrenceID creation

Open 7yl4r opened this issue 4 years ago • 7 comments

Should these be UUIDs always? Should data go into the ID? What are our recommendation(s) and where should we document the recommendation here?

Jonathan Pye:

det_compressed['eventid'] = 'urn:catalog:otn:' + det_compressed['detectedby'].astype(str) + ':' + det_compressed['rcvrcatnumber'].astype(str)

Time Van Der Stap:

project-cruise-station-cast#-sampleID (if needed)

7yl4r avatar Nov 10 '21 21:11 7yl4r

It would be really cool to figure this out and then write a function that can be called to reproducibly generate an ID: get_occurrence_id(occurrence_column)

7yl4r avatar Nov 10 '21 21:11 7yl4r

13:16:21 From Jonathan Pye to Everyone: my eventid builder: det_compressed['eventid'] = 'urn:catalog:otn:' + det_compressed['detectedby'].astype(str) + ':' + det_compressed['rcvrcatnumber'].astype(str) so that'd come out to urn:catalog:otn:[project code]:[listening_station_code] for an event that is the deployment of a listening station 13:17:10 From Jonathan Pye to Everyone: rcvrcatnumber = otn:[projectcode]:[serialno]-[model]-[deploydate]

mobb avatar Nov 10 '21 22:11 mobb

13:15:26 From Tim Van Der Stap to Everyone: for the current eventIDs I've created they've typically followed the format: project-cruise-station-cast#-sampleID (if needed)

mobb avatar Nov 10 '21 22:11 mobb

That's the goal. Get down to a combination of some set of entities that guarantee a unique identifier. An event that says 'this instrument, was deployed here at this time, in association with this project.'

Even there project is superfluous if everyone's serial numbers are nicely unique. But the thing comes out to (otn says that):(the BDL project reported):(instrument 12049245)-(the Vemco VR2W)-(on this date)

otn:BDL:12049245-VR2W-2012-09-01

so there's my eventID for deploying that receiver. Now, having the detections in hand, I can work from the tagged animal information from another project to produce the correct associated eventID even if this tagging project didn't produce it.

For a detection at that receiver of a blue shark under NSBS we can file an occurrence extension record with the eventID above. The occurrenceID is read as: (otn says that):(we saw an animal from the NSBS project):(it was named Sheena by the NSBS project of OTN(the appropriate organismID) ): (it was the first time we saw it (autoincrementer) )

otn:NSBS:otn-NSBS-Sheena:1

jdpye avatar Nov 12 '21 16:11 jdpye

Related, what do you do if you are missing a bit of information for one record?

Using otn:NSBS:otn-NSBS-Sheena:1 as an example.

Do you:

  1. Leave it blank (eg otn::otn-NSBS-Sheena:1)
  2. Assign some fill value (eg. otn:_:otn-NSBS-Sheena:1 or otn:nan:otn-NSBS-Sheena:1)

For AMBON zooplankton I stuck in ? as a temporary holder to help when reviewing the ID's (I know ? is a terrible choice, hence the question above). Here's an example:

AMBON_Zooplankton_2017_BBL1_?_Pisces_larvae_?_TWINRING_150UM_MICROSCOPY_nan_12_2017-08-20T22:48:00

That ID is generated as follows:

df['occurrenceID'] = df[['datasetID',
                         'Station',
                         'Cast_Number_conv',
                         'Accepted_Organism_Identification_conv',
                         'Life_Stage_conv',
                         'sex',
                         'Type',
                         'biomass_str',
                         'depth_str',
                         'date_str',
                        ]].agg('_'.join, axis=1)

MathewBiddle avatar Nov 12 '21 16:11 MathewBiddle

I don't have a good analogue I'm afraid, I haven't got any cases where I won't know the animal ID (assigned by me in cases where researcher doesn't specify), or the project code (also assigned by me).

I guess i am picking unique values logically and physically speaking, but also ones that are not solely researcher-derived. If I don't have a deploy date, i'm chasing a researcher for it, i can't publish or share their data until I know it's from a real deployment in the field and not a detection on a lab bench somewhere, for example.

jdpye avatar Nov 12 '21 17:11 jdpye

In your example, if they're differentiating on lifestage and sex for the observation, it makes sense to have those in a field. 'this is the time we were looking for X lifestage and X sex for X species'. In occurrence core I guess the ID has to carry all this weight, but if it were event core, the event that spawns it might do a lot to inform the creation of the occurrenceID, which then wouldn't need a lot of referential components, since its spatial, temporal and maybe logistical dimensions like replicate are handled by virtue of its parent Event.

jdpye avatar Nov 12 '21 17:11 jdpye