bdq
bdq copied to clipboard
TG2-VALIDATION_EVENTDATE_INRANGE
TestField | Value |
---|---|
GUID | 3cff4dc4-72e9-4abe-9bf3-8a30f1618432 |
Label | VALIDATION_EVENTDATE_INRANGE |
Description | Is the value of dwc:eventDate entirely with the Parameter Range? |
TestType | Validation |
Darwin Core Class | dwc:Event |
Information Elements ActedUpon | dwc:eventDate |
Information Elements Consulted | |
Expected Response | INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is bdq:Empty or if the value of dwc:eventDate is not a valid ISO 8601 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive, otherwise NOT_COMPLIANT |
Data Quality Dimension | Conformance |
Term-Actions | EVENTDATE_INRANGE |
Parameter(s) | bdq:earliestValidDate |
bdq:latestValidDate | |
Source Authority | bdq:earliestValidDate default ="1582-11-15" |
bdq:latestValidDate default = "{current year}" | |
Specification Last Updated | 2024-09-16 |
Examples | [dwc:eventDate="1962-11-01T10:00-0600": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:eventDate is IN_RANGE"] |
[dwc:eventDate="2300-11-01T10:00": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:eventDate is NOT_IN_RANGE"] | |
Source | VertNet |
References |
|
Example Implementations (Mechanisms) | Kurator:event_date_qc |
Link to Specification Source Code | FilteredPush event_date_qc DwCEventDQ.validationEventdateInrange() |
Notes | This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this. Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold. |
Comment by Lee Belbin (@Tasilee) migrated from spreadsheet: Was thinking of adding a lower bound to make it a more comprehensive test, but could we have fossil eventDate?
Needs clarification for eventDate values which are ranges and which span the oldest/youngest boundaries. For example, 1700-01-01/2100-01-10 is an entirely valid eventDate value with a range which includes all likely specimen collecting dates extant, or for some time into the future. Under the current definition, this value (which is in essence a placeholder for "we don't know what the date was"), fails the test. Similarly 1650-01/1850-02 would be expected to fail, simply because it places a lower bound to the uncertainty earlier than the default 1700. Framing the test to mark as problems any range which extends outside the 1700-present range will potentially encourage people to frame uncertainty about dates too narrowly, instead of setting reasonable uncertainty values for their situation. I'd prefer to just flag eventDate values which fall entirely outside the specified range. Other potential failure cases produced by considering ranges that span the boundaries as problems are an eventDate who's value is the current date, without a time. This is a time interval that extends into the future, and a reasonable implementation of the test as stated would mark any record with an eventDate consisting of the current date without a time as an error - something not desirable when the quality control processes are placed upstream close to initial data capture.
@chicoreus I don't see a problem here - we are not saying it is wrong - just a warning that it is out of range. What is done with that is up to the user, but it flags a possible problem. With annotations - a followup annotation may be that this is OK, because ...
The problem is again on different interpretations of how to represent uncertainty in eventDate values. A European institution with old collections which very reasonably decides to set 1400-01-01 and 2100-01-10 as end boundaries for any events where the collecting date is not known (the 2100 date making these records very easy to find and distinguish from ones which have had the date narrowed based on some additional interpretation), and would have all of these flagged as problems binned in with real problem records such as the typical typo 190-10-01. It is very rational from a database perspective to set an end date at some distant future point for all records with uncertainty, this makes them easy to find and collect). I'm not at all in favor of a position that declares that ranges that fall outside the likely bounds are problems. I'd much rather see a narrower test for intervals that entirely fall outside the range of plausible collecting event dates - that should get a much smaller set of false positives and more effectively identify problematic data that needs to be fixed.
The today's date will fail issue (because today's date to a resolution of one day in an ISO date is a temporal interval that extends into the future, unless special case handling is added for today's date) also makes this test highly problematic for upstream uses near the point of observation.
I can understand that at the dataset level, but would expect it to be very rare at the record level. The earliest date can be a designated date for the run as well if you need to set an earlier date for some reason - or particular dataset. I don't see it as a big issue.
I'm a simple soul. I side with @ArthurChapman. We have to be careful that we don't errect obstacles that eveyone is then forced to climb over. KISS. Others?
Another way of putting the problem I am seeing: By treating any range that extends beyond 1700-today as an error is conflating two classes of problems: (1) errors in accuracy (e.g. 198-10-15), and (2) broad statements about uncertainty (1500/2100). Broad statements about uncertainty are already captured separately with a measure of event duration. I will argue that it is important to be able to identify the first class of error in isolation, by implementing this test (in the easier way) by flagging records who's range falls entirely outside the range 1700-present. The current statement of the test is more complex, as it raises the specter of special case handling of records with today's date. I also like KISS, and argue that the current description isn't the simple one.
About 10% of the MCZ data has an unknown event date, recorded in the database (which enforces a start and end date as oracle date fields) as 1700-01-01/2100-01-01. From a database perspective, this is a very useful pair - it is very easy to extract those 183136 records on the basis of those values, narrowing by any inference makes these harder to locate as a single sort of data quality issue.
OK, I'll buy it (range outside 1700-present) @chicoreus , but I would like to hear from the rest of the team.
How many institutions do this other then MCV? It does seem to be a problem. Under your reasoning @chicoreus - we can't only do "not in future" It would appear to me that the field is being used in ways it was never meant to be used, but I can't see any simple way around it other than to remove this test altogether.
Re-examining this validation, I cannot see a problem with flagging a suspicious date (or date range) that is before 1700 or after the day the test is run. A "NOT COMPLIANT" would seem useful information to follow up on. A false positive flag seems prefereable to me that a false negative where one end of a range is totally outside 1700-today.
Considering #66, I'd be inclined to include invalid dates (e.g., Feb 30) under this test as they are not in the possible range of dates, and they may well be formatted to ISO standard. This would make this validation dependent on #61.
I'll suggest that we split this test into two separate tests, one of which tests whether or not the event date extends outside the boundaries 1700-present, and the other to test whether or not the event date falls entirely outside the boundaries 1700-present. The first test (crosses out of bounds) may represent problematic data or it may represent a large uncertainty. The second test (falls entirely out of bounds) likely flags data that contains errors (e.g. typos that leave a digit out of the year 190-05-18), but can potentially also flag rare but valid older material, and certain representations of zooarcheological material. This fits a principle of keeping tests simple and focused on particular potential problems.
I have a proposal for a change in the expected response. Instead of
"INTERNAL_PREREQUISITES_NOT_MET if there is no default designated date or the field dwc:eventDate is either not present or is EMPTY; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT"
I propose
"INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is either not present or is EMPTY; COMPLIANT if the no part of the range of dwc:eventDate extends outside optionally-provided begin and end dates; otherwise NOT_COMPLIANT. If no end date is provided, the test should use the current time as an upper bound."
I would also change the Notes. Instead of
"The results of this test are time-dependent. A invalid date for tomorrow will be valid tomorrow. This test provides the option to designate a lower limit to the date, which for specimen records should be 1700-01-01 by convention. (Thus this test has two parameters, a boolean to use or not use a lower bound, and a lower bound, which defaults to 1700-01-01). NB if the parameter is not set, it defaults to 1700-01-01."
I propose
"The results of this test are time-dependent. Today the date for tomorrow is not valid. Tomorrow it will be. This test provides the option to designate lower and upper limits to the date. The upper limit, if not provided should default to the time when the test is run. There should be no default lower limit. NB By convention, use 1700-01-01 as a lower limit for collecting dates of biological specimens."
The proposal from @tucotuco makes sense. We do, however need to specify how the test should behave when the dwc:eventDate is not a valid ISO date. I propose this should be INTERNAL_PREREQUISITES_NOT_MET
@chicoreus #66 cehcks for the ISO standard in eventDate. If #66 is run prior to #36, then it would have already been covered. This goes back to a workflow of the order of tests.
@ArthurChapman: I agree. This is what we concluded yesterday - that we should not need to re-test for a condition if it had already been tested. And yes, this means workflow dependencies (which we already had).
I agree with @chicoreus. Tests must be defined independent of each other and of any abstract workflow that might use them. Every tests must deal appropriately with whatever input it is given.
True that if our recommended workflow order is followed, this test might not be run at all when its internal prerequisites are not met.
@ArthurChapman we must not assume that implementors will run validations in any particular order, indeed, parallelized implementations where the order in which validations is run is non-deterministic are likely at large scale. Also, each test must be able to stand in isolation to be mixed and matched with other core or non-core tests to meet the needs of additional use cases. By imposing assumptions about validation order on the test definitions, we are in effect limiting their utility to only core use cases, not letting them be reusued for other needs.
Also, implementors should develop tests in parallel with unit tests of those implementations, and the unit tests should test the behavior of the tests under edge case conditions, text strings containing non-iso dates are expected edge cases for the testing of all of the tests that take dwc:eventDate as an information element, if we don't define it, the behavior will be undefined, and some implementors might make implementations that embed interpretation and return compliant for the same value that other implementors return as non-compliant, and other implementors return as prerequisites not met. Better to tell implementors what to do in this case, without assumptions about order of tests and the turning on and off of different tests.
Consider the value dwc:eventDate="1820-4-3" and three implementors who handle the format error differently in the test internals.
Implementor 1 tests for iso format, and returns non-compliant before testing range in #36. #66 returns NOT_COMPLIANT #36 returns NOT_COMPLIANT
Implementor 2 tests for iso format, and returns internal prerequisites not met before testing range in #36. #66 returns NOT_COMPLIANT #36 returns INTERNAL_PREREQUISUTES_NOT_MET
Implementor 3 parses the string into year/month/day-year/month/day integers, doesn't recognize that the format isn't correct, and ends up testing the range in #36. #66 returns NOT_COMPLIANT #36 returns COMPLIANT
Implementor 4 has a workflow system for the tests that doesn't run downstream tests that have their assumptions not met and never gets to #36. #66 returns NOT_COMPLIANT #36 is not run.
End consumer of the data quality reports is confused.
If we are specific about the handling of the problematic case for this issue then:
All implementors test for iso format, and returns internal prerequisites not met before testing range in #36. #66 returns NOT_COMPLIANT #36 returns INTERNAL_PREREQUISUTES_NOT_MET
Or, implementors with the workflow system that recognizes test dependencies leave out #36. End users are not confused by some implementations saying their data are compliant and others saying it is not compliant.
@chicoreus Accepted. @Tasilee - we must change some of the Expected Response as a resulty of this decision.
This is one of two TIME tests that has an issue in implementation by the specification. I propose changing the specification from:
INTERNAL_PREREQUISITES_NOT_MET if there is no default designated date or the field dwc:eventDate is EMPTY; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT
to:
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT
@chicoreus. Your change makes sense - as we do have a default Parameter date, so the first part of the old wording, now makes no sense. I am not sure, however, of the need for the word "optionally" as if there is no date designated, then it defaults to the default date. In some of the other tests we have used words something like "... dwc:eventDate does not extend beyond the Paramater range" or "... Parameter limits" That then caters for the earliest date if parameter set or the default, and also caters for the future with bdq:latestValidDate set as current date"
@chicoreus Another thought to bring it more in line with others and to make it positive rather than negative
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is within the parameter range, otherwise NOT_COMPLIANT
@ArthurChapman that latest makes good sense. It makes it clear that an eventDate who's range extends beyond the specified earliest and latest dates is not compliant, that is the eventDate must fall entirely within the range specified by the earliest and latest parameters.
@chicoreus. And it is consistent with what we have done for elevation, ddepth, etc. where we have used similar wording. I like simplicity and consistency - both of which should aid in coding.
Ditto on simple. I've updated the Expected Response.
Good. However the current definition doesn't match the notes which indicate that the test of the lower bound is optional. This can't be taken from just the specification and the parameters, and this statement exists in only the notes.
Notes simplified: Is @chicoreus happy?
@Tasilee I am happy, don't know if the archeozoological community will be happy. The intent, as I understand it, of making the lower bound optional was to accomodate them. @tucotuco, thougts?
Can't they simply use -100000-01-01 as "To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver.[19] An expanded year representation [±YYYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign"
They could, but the "agreed-upon number of extra year digits" is a potential problem, as we would have to specify the number of allowed extra digits, and 6 digits and up to -999,999 might not be enough. Since the parameter is to an api that we are specifying, the prior agreement bit isn't a concern. However, this gets us to whether dwc:eventDate allows for years before 0000, and, given that prior agreement phrasing, I rather suspect it doesn't, dates prior to 0000 would need to go off to the geological age terms under the current phrasing. @tucotuco?
@Tasilee by the way, and event_date_qc implementation without the lower bound being optional is about 1/3 the size and much cleaner to read than implementation where the lower bound is optional....