data-infra
data-infra copied to clipboard
GTFS RT validation models should have GTFS validator version for all rows
As a transit data quality analyst, I want to know what GTFS validator version was run on a given date so that if/when we eventually upgrade the validator it will be clear what version was being used at what time.
Currently, validator version is only populated if notices were actually present for the given code: https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_quality/fct_daily_rt_feed_validation_notices.sql#L36.
I think that we will need to rearchitect this a little bit to be more similar to how the GTFS schedule validation is handled, where validator version is configured at the date level (https://github.com/cal-itp/data-infra/blob/main/jobs/gtfs-schedule-validator/gtfs_schedule_validator_hourly.py#L151-L162).
I think that we are using this array_agg method to handle potential cases where different validator versions are used within the given date but we should probably just commit to upgrade the validator in such a way that only one validator version is used per day and make the validator version a string rather than an array, and populate it in a more guaranteed fashion.
The schedule validation notices mart model has some relevant logic: https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_quality/fct_daily_schedule_feed_validation_notices.sql