gbif-api icon indicating copy to clipboard operation
gbif-api copied to clipboard

What and how to return Occurrence eventDate?

Open cgendreau opened this issue 8 years ago • 24 comments

The goal of this issue is to document and collect feedback about the future response of the API related to the date to which an occurrence record occurred. This is strictly related to the response of the API.

Supported date ISO formats

The following ISO date formats, based on the granularly provided by the data publisher, will be returned by the API:

Pattern Example value
yyyy 2016
yyyy-mm 2016-03
yyyy-mm-dd 2016-03-05
yyyy-mm-dd'T'hh:mm:ss 2016-03-05T13:03:07

Date intervals using the formats above separated with a slash (/): 2016-03-05/2016-03-06.

How should it be returned by the API?

Option 1 - Update/reuse the current eventDate

This option consists in reusing the current content of the JSON field eventDate returned by the API to return all supported ISO format dates. Following this change, the eventDate could contains values like:

"eventDate":"2012-03/2012-04"
"eventDate":"2012-03-01/2012-04-02"
"eventDate":"2012-03-01"
"eventDate":"2012-03"
"eventDate":"2016-03-05T13:03:07"

Option 2 - introduce new field(s)

This option consists in adding new field(s) to the JSON response. For example :

"isoEventDate":"2012-03/2012-04"

The current eventDate could be one of the following:

  • maintained but always empty
  • maintained and provided only for non date range
  • maintained and always filled (for date ranges it would contains the start of the range)

What to do with year, month, day

This is an open question for both options.

The different options are:

  • remove them to avoid confusion
  • maintain them but provide values only for non date range
  • maintain them and always return a value(for date ranges it would contains the start of the range)

cgendreau avatar Sep 15 '16 14:09 cgendreau

+1 for Option1

fmendezh avatar Sep 15 '16 14:09 fmendezh

  • I prefer Option1
  • I vote for removing year/month/day fields, i imagine reduces confusion and makes JSON response slightly smaller

sckott avatar Sep 15 '16 18:09 sckott

Thanks as always Scott

Can you please comment on your perception of how this will impact existing tools built on top of rGBIF?

It's manageable to coordinate with you so that rGBIF updates simultaneously, but anyone who has deployed a workflow using rGBIF who does not update the rGBIF dependency may suffer. In doing the update they may also have to make some edits.

This is the first significant breaking change we're pondering, and it's difficult to judge if it would have a large impact or be a welcomed enhancement.

Any thoughts?

On Thu, Sep 15, 2016 at 8:58 PM, Scott Chamberlain <[email protected]

wrote:

  • I prefer Option1
  • I vote for removing year/month/day fields, i imagine reduces confusion and makes JSON response slightly smaller

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gbif/gbif-api/issues/4#issuecomment-247419562, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOepUbU5BeP7sKx80GBuxKO7p_8mQ6Zks5qqZVFgaJpZM4J96rn .

timrobertson100 avatar Sep 15 '16 19:09 timrobertson100

+1 for Option1 and removal of ymd

mdoering avatar Sep 15 '16 20:09 mdoering

Can you please comment on your perception of how this will impact existing tools built on top of rGBIF?

I'd imagine the "tools" built with rgbif are likely to be mostly analysis scripts from researchers using GBIF data, but I don't know for sure since it is open source after all. There are some web apps out there built with http://shiny.rstudio.com/ that may be affected. I can put some feelers out to researchers that use the R client and ask if proposed changes will affect them.

It's manageable to coordinate with you so that rGBIF updates simultaneously, but anyone who has deployed a workflow using rGBIF who does not update the rGBIF dependency may suffer. In doing the update they may also have to make some edits.

Right, and we can't force users to update. I can add warning messages for users during the transition and add notes to the docs to make it more likely users know about the changes.

The Python client can be updated at the same time - it's been around less time though so probably not as big of a deal

sckott avatar Sep 16 '16 05:09 sckott

Hi! I vote for option 1 and keep day,month and year fields. Apart from the compatibility issues, mentioned by @sckott , this fields make possible to index by any time on the client side when using a lot of data.

molgor avatar Sep 16 '16 13:09 molgor

Thanks @molgor When the eventDate is a date range, what would you expect in year, month, day fields?

cgendreau avatar Sep 16 '16 13:09 cgendreau

Just to add, date ranges are commonly used in sampling event data.

Knowing the total sampling time allows the quantity of a species recorded in a sample to be calculated per hour, per day, etc.

Here are a few examples of sampling event datasets that use traps operating for many hours or days and thus record the date of the sampling event as a range:

kbraak avatar Sep 16 '16 13:09 kbraak

hi @cgendreau ! I'd like the corresponding substring of the field EventDate. with padded zeros ( e.g 08) and full year (yyyy). Perhaps the user could be interested in looking for seasonality presences using month or day ranges and it can easily perform filters based on months instead of parsing the eventdate field previously.

molgor avatar Sep 16 '16 14:09 molgor

Sorry @molgor I meant when eventDate contains a date range like "2012-03-01/2012-04-02". Would you expect year, month, day to be empty or to contain something like 2012, 3 and 1?

cgendreau avatar Sep 16 '16 14:09 cgendreau

Sorry @cgendreau, exactly, keep the minimum date.

molgor avatar Sep 16 '16 14:09 molgor

@kbraak we also already have around 31M records using a date pattern as yyyy-mm-dd/yyyy-mm-dd

cgendreau avatar Sep 16 '16 14:09 cgendreau

option 1 looks good to me. It might be worth considering adding a field that describes how the date is formatted in eventDate to not have to guess which of the possible combinations is used.

fmichonneau avatar Sep 16 '16 14:09 fmichonneau

@fmichonneau but I think it would follow ISO spec, and we can I think simply look for a / to see if it's a range, and if so, split on that /

sckott avatar Sep 16 '16 16:09 sckott

I vote option 1 and no separate year, month, day fields. I am guessing that most people/applications using the api will be able to parse out that information and redundancy is confusing. However, I am a new user of the API, so take my opinion for what it is worth (possibly little).

zachary-foster avatar Sep 16 '16 17:09 zachary-foster

My response assumes that the original data are always still available in the download if you want them.

Support valid ISO 8601(E) to the fullest extent possible for eventDate, or work to change the definition of Darwin Core's eventDate. Supporting the term as it is defined should obviate the need for the documentation suggested by @fmichonneau.

Construct this eventDate as completely and as unambiguously as possible from the full compliment of event data provided by the data publisher.

Construct the year, month, day as completely and as unambiguously as possible from the full compliment of event data provided by the data publisher. This means that if there is a date range spanning days, day is left blank. If there is a date range spanning months, month is left blank. If there is a date range spanning years, year is left blank.

Construct the startDayOfYear and endDayOfYear as completely and unambiguously as possible from the constructed eventDate. This will facilitate the seasonal interests expressed by @molgor. If date resolution is not specific to the day for the start date, startDayOfYear is left blank. If date resolution is not specific to the day for the end date, endDayOfYear is left blank.

By making all of these constructed data consistent, and having access to the original, it should not introduce any of the confusion anticipated by @zachary-foster.

Some Python (sorry Tim) code to demonstrate this approach can be found at:

https://github.com/VertNet/post-harvest-processor/blob/master/lib/vn_utils.py#L368

tucotuco avatar Sep 17 '16 09:09 tucotuco

I prefer the simplest possible solution, avoiding confusion and making long-term maintenance easier, therefore I vote for solution 1, and remove the separate fields. While I agree it's possible to deduce other fields completely and unambiguously, I think it's unnecessary: If all the data fit (unambiguously) in 1 standardized field, why make things more complex?

Other points:

  • Using the ISO standard will make it easy for consumers (parsers available for every language around!)
  • There are still probably "not too many" API consumers in the wild, so it's still time to make breaking changes, especially if they remove ambiguity and makes the whole system cleaner and easier to maintain.
  • Can we use API versioning (there's currently a V1 in the URL) for such changes, keeping the previous version online for a few months? That could at the same time a) let users adapt at their pace b) make GBIF less shy in breaking the API, when needed.
  • Of course, all that assumes that the ISO standard work for all cases/existing records.

Hope this helps!

niconoe avatar Sep 20 '16 08:09 niconoe

BTW, big :thumbsup: to the team for discussing such issues publicly on GitHub!

niconoe avatar Sep 20 '16 08:09 niconoe

  1. I would opt for option 1, and removing YMD fields, but I'm also happy to keep the YMD as @tucotuco suggests. It also helps quick lookups.
  2. Regarding breaking changes: how is the info currently formatted in eventDate? I assume it does currently not contain ranges and that might be the biggest change?
  3. For ranges with identical start/end date (2012-03-04/2012-03-04), please return a single date: 2014-03-04

peterdesmet avatar Sep 23 '16 07:09 peterdesmet

Apropos: How do we distinguish between eventDate duration and uncertainty?

How do we distinguish if an eventDate such as "1981-01-01/1982-12-01" describe an uncertainty in accuracy of two years, or if the event had an actual duration of two years?

With old museum specimen data the date or even exact year might not be known.

dagendresen avatar Jan 04 '17 13:01 dagendresen

@dagendresen There could be a combination of the two: and event that lasts a long time but has uncertain start/end dates.

One solution could be to have another field called "endDate" that is only populated for events with a duration. Then each could be unambiguously uncertain.

Another could be to have different seperators for the two (e.g. "--" for duration and "/" for uncertainty).

zachary-foster avatar Jan 05 '17 22:01 zachary-foster

Thanks @zachary-foster EarliestDateCollected and LatestDateCollected were declared as Darwin Core terms between 2007-04-17 and 2009-04-24, when replaced by eventDate using slash (/) for range. http://rs.tdwg.org/dwc/terms/history/index.htm#EarliestDateCollected-2007-04-17 http://rs.tdwg.org/dwc/terms/history/index.htm#LatestDateCollected-2007-04-17 http://rs.tdwg.org/dwc/terms/history/index.htm#eventDate-2009-04-24

dagendresen avatar Jan 06 '17 07:01 dagendresen

Regarding the question of what to do with year/month/day, perhaps they are redundant for most usages, but I can imagine at least one practical usage of them alone, which I can't figure out how to reproduce using a date or daterange field.

I want to filter my API searches by a certain period of the year, at any year. For example, I could be interested in detecting Spanish winter plant occurrences, so I would filter using "month": from december to february

http://api.gbif.org/v1/occurrence/search?TAXON_KEY=6&COUNTRY=ES&BASIS_OF_RECORD=PRESERVEDSPECIMEN&BASIS_OF_RECORD=PRESERVED_SPECIMEN&MONTH=12%2C*&MONTH=*%2C2

That way I could get data from that period of the year, no matters if these are plants collected last year or two centuries ago. Perhaps this can be done in some other way using dates or dateranges, but I don't know how to do it in a single API request (unless event searchable fields support using wild characters for year/day digits).

dgasl avatar Mar 05 '17 18:03 dgasl

As https://github.com/gbif/portal-feedback/issues/652#issuecomment-343407595 suggested to continue discussion on date ranges here: Could the current interpretation of date ranges to single dates be flagged with DATE_RANGE_IGNORED?

peterdesmet avatar Feb 11 '20 09:02 peterdesmet

New proposal

(Also for discussion on the mailing list and community forum.)

A longstanding issue with the GBIF API is the interpretation and formatting of the Darwin Core term "eventDate".

Summary: instead of GBIF changing published eventDate values like 2009-03-18/2009-04-13 and 2010 to 2009-03-18 and 2010-01-01 respectively, we propose returning the values 2009-03-18/2009-04-13 and 2010 in the occurrence API and in downloads. Existing code/scripts that use the eventDate value may need to be updated.

The recommended best practise for the term is "use a date that conforms to ISO 8601-1:2019" (see https://dwc.tdwg.org/terms/#dwc:eventDate).

ISO 8601-1:2019 supports date ranges, and some publishers provide these. Examples are 2000-05, or 2007-11-13/2007-11-15. GBIF's current interpretation changes date ranges like this to the first possible day in the range (2000-05-01 and 2007-11-13).

At least 64 million occurrences are affected.

Change to date interpretation

We propose changing the eventDate field in the GBIF API to support ISO 8601-1 date ranges. A range will be returned where one was provided by the publisher, either directly as a range in the eventDate field, or through a combination of the year, month, day, startDayOfYear and endDayOfYear fields.

The data quality checks on dates will be improved to check for consistency between these fields: eventDate, year, month, day, startDayOfYear and endDayOfYear. These fields will only be populated if they are constant for the whole range of dates — a range spanning several days in January 2020 will have year=2020, month=January and day=(Blank).

startDayOfYear and endDayOfYear will also be present if the range is accurate to days.

Examples:

published event date intepreted eventDate int. year int. month int. day int. sdoy int. edoy
2023-01-13 2023-01-13 2023 1 13 13 13
2023-01 2023-01 2023 1
2023 2023 2023
2023-01-13/2023-01-14 2023-01-13/2023-01-14 2023 1 13 14
2023-01-13/14 2023-01-13/2023-01-14 2023 1 13 14
2023-01/2023-02 2023-01/2023-02 2023
2023-01/02 2023-01/2023-02 2023
2023/2024 2023/2024
2023-01-01/2023-12-31 2023-01-01/2023-12-31 2023 1 365

Other cases where we can unambiguously determine a date or date range will also be handled, for example a record with a year and month but no eventDate, or non-ISO dates like January 2023.

API example:

This record (portal link) is published with eventDate=2009-03-18/2009-04-13, year=2009, month=3, day=18. We currently change the eventDate:

"year": 2009,
"month": 3,
"day": 18,
"eventDate": "2009-03-18T00:00:00",

With this proposal, we would preserve the eventDate but remove day, as it the event crosses several days:

"year": 2009,
"month": 3,
"eventDate": "2009-03-18/2009-04-13",

This record (portal link) is published with eventDate=2019-04-06T20:00:00/2019-04-10T05:00:00 and no separate day, month or year values. Currently, we process it to this:

"year": 2019,
"month": 4,
"day": 6,
"eventDate": "2019-04-06T20:00:00",

Instead, we propose returning this:

"year": 2019,
"month": 4,
"eventDate": "2019-04-06T20:00:00/2019-04-10T05:00:00",
"startDayOfYear": 96,
"endDayOfYear": 100,

Searching

The search and download APIs will be affected by this change.

Occurrences will be returned if the occurrence date/date range is completely within the query date or date range.

Search: eventDate=2023-01-11
Record: eventDate=2023-01-11    -- included
Record: eventDate=2023-01       -- EXCLUDED
Record: eventDate=2023-01-11/12 -- EXCLUDED

Search: eventDate=2023-01-11,2023-01-12
Record: eventDate=2023-01-11    -- included
Record: eventDate=2023-01       -- EXCLUDED
Record: eventDate=2023-01-11/12 -- included

Search: eventDate=*,2023-01 (meaning "Before end of January 2023")
Record: eventDate=2023-01-11    -- included
Record: eventDate=2023-01       -- included
Record: eventDate=2023-01-11/12 -- included

Search: eventDate=2023-01,2023-01 (meaning "After start of January 2023 AND before end of January 2023")
Search: eventDate=2023-01 (same meaning)
Record: eventDate=2023-01-11    -- included
Record: eventDate=2023-01       -- included
Record: eventDate=2023-01-11/12 -- included

This implementation will avoid returning occurrences with eventDates like "2010/2021" in many queries. (There are millions of occurrences with large ranges like this.)

Density maps

There is a year filter for the density/pixel maps. An occurrence from 2023-01 will be included, but an occurrence with an eventDate spanning more than a single year (like 2022-13-31/2023-01-01) will no longer be included.

Quarterly analytics, global/regional trends

The quarterly analytics include calculations based on the individual dwc:year, dwc:month and dwc:day fields. The statistics will be affected where these values change or become blank.

rGBIF, PyGBIF

Both libraries will be updated as necessary to support eventDate values containing a date range.

Feedback

We have delayed addressing this issue for a long time, primarily due to concerns about changing the existing behaviour of the API. However, it's also one of the most frequently requested improvements to GBIF's interpretation.

If you are aware of software or systems which would have problems adapting to the proposed change, please let us know, either here, on the API users mailing list, the community forum or by email to me.

We will alert users in the same places when the change is ready to be tested on the test system at api.gbif-uat.org, where it will be ready for testing for at least 2 weeks. We will also inform users when the change is to be made live on api.gbif.org.

Thank you,

MattBlissett avatar Jan 17 '23 14:01 MattBlissett

What, if anything, might happen to the outputs from this? https://github.com/gbif/occurrence/blob/master/occurrence-hive/src/main/java/org/gbif/occurrence/hive/udf/ToISO8601UDF.java. That's used here https://github.com/gbif/occurrence/blob/master/occurrence-download/src/main/resources/download-workflow/bionomia/hive-scripts/execute-bionomia-query.q#L36 from which I import into a relational db datetime field.

dshorthouse avatar Jan 19 '23 03:01 dshorthouse

@dshorthouse, I expect ToISO8601UDF to be modified or replaced to support ISO 8601 year-only, year+month-only and ranges.

Since Bionomia has its own download format, we can adjust that format to keep the existing behaviour (take the start of any range, set year/year-month dates to the first of the year/month). Or, we can output values like 2010, 2021-05, 2005-06-08/2005-07-01 and Bionomia can decide how to handle them.

MattBlissett avatar Jan 20 '23 15:01 MattBlissett

Note to myself: as well as Bionomia, the Map of Life custom download format will need consideration.

For the moment, I'm retaining the existing behaviour of using the earliest date.

MattBlissett avatar Sep 19 '23 13:09 MattBlissett

Hi everyone,

Event dates — upcoming API change

Early this year we announced a plan to change the way we handle the "eventDate" Darwin Core term. Date ranges formatted using the ISO 8601 standard, recommended by Darwin Core, will retain their meaning, and the API will return values like "2000-05" or "2007-11-13/2007-11-15", rather than the current behaviour of changing these values to "2000-05-01" and "2007-11-13".

These changes are now visible on GBIF's test system, GBIF-UAT.org. To allow time for you to test this change against any existing software and scripts you have, we will not implement these changes on GBIF.org before early November.

API users

Users of the occurrence API will need to decide how to handle an eventDate like "1880/1889", "1910", "2000-05", "1999-11/2000-03", "2007-11-13/2007-11-15" or "2023-09-22T05:17:10/2023-09-22T12:17:10" — taking the earliest, latest or middle value, randomizing within the range, excluding them etc. To make parsing easier ranges will always be formatted using the full form and never abbreviated — always "2007-11-13/2007-11-15" and never "2007-11-13/15".

It may be easier to use the individual "year", "month" and "day" fields, which will be present if the year/month/day is constant for the whole range of the eventDate — eventDate=2010-11-25/2010-12-03 will have year=2010, month=NULL, day=NULL as only the year is constant. (However, note a date like 2022-12-31/2023-01-01 covers just 2 days, but as is spans two different years the "year" field will be blank.)

When searching using a range (e.g. eventDate=2005-01,2005-03) only occurrences with eventDates entirely within the range will be returned.

Download users

The "eventDate" column in CSV, Darwin Core and Parquet (cloud snapshot) downloads will contain the same value as in the API, for example "2023-09-22T12:17:10", "2023-09-22", "1880/1889", "1910", "2000-05", "1999-11/2000-03", "2007-11-13/2007-11-15" or "2023-09-22T05:17:10/2023-09-22T12:17:10".

As with the search API, when filtering using a range (e.g. eventDate=2005-01,2005-03) only occurrences with eventDates entirely within the range will be returned

Data interpretation (for data publishers)

Eight Darwin Core terms record information on when an occurrence was collected or observed:

  • year
  • month
  • day
  • eventDate
  • eventTime
  • startDayOfYear
  • endDayOfYear
  • verbatimEventDate

Some records will have conflicting information in these fields. Detailed documentation on how we handle the various cases is being prepared, but the general approach is to remove parts of the date that conflict, adding a RECORDED_DATE_MISMATCH issue in this case. For example, "eventDate=2005-06-01", "year=2005", "month=6" and "day=NULL" would have eventDate changed to "2005-06" and the issue added.

Occurrences published with only one/some fields will have the other fields filled in automatically, where possible. We will not add an issue flag for this.

Example dataset

A dataset of test occurrences is here: https://www.gbif-uat.org/occurrence/search?dataset_key=d6167827-973d-429a-a00c-8ea294d62d80 providing many examples of consistent and conflicting event date fields. The scientificName is set to a summary of what the eventDate is, and the eventRemarks field has more explanation.

Feedback

Feedback is welcome on this GitHub issue, or the mailing list or Discourse forum.

Thanks,

Matt

MattBlissett avatar Sep 26 '23 11:09 MattBlissett

Thanks Matt, reviewing. Note that some of the eventRemarks in the example dataset look truncated ("Event date (two days). Maybe a CSV parsing issue?

peterdesmet avatar Sep 26 '23 12:09 peterdesmet