gtfs-validator
gtfs-validator copied to clipboard
Flag if a URL listed inside the GTFS dataset doesn't respond/exist
Describe the problem
A user has asked if this validator could validate if the URLs provided in the GTFS dataset (e. g. agency_url, stop_url, etc) work as intended.
The specification says:
URL - A fully qualified URL that includes http:// or https://, and any special characters in the URL must be correctly escaped. See the following http://www.w3.org/Addressing/URL/4_URI_Recommentations.html for a description of how to create fully qualified URL values.
Although there is no explicit mention that the URL needs to not through a 404 error, this seems like a very useful addition to this validator that is in line with "fully qualified URL".
Describe the new validation rule
If one of the URL fields in the GTFS dataset through a 404 Error, generate a Warning.
Sample GTFS datasets
No response
Severity
WARNING
Additional context
No response
I have created a PR a few days ago. The acceptance tests are failing. After analysis of this test, I found that the failure is due to the added time it takes to validate the urls. Some of our datasets have thousands of urls and it take approximatively 3-4 seconds to validate each (for 3000 url entries it adds at least 5min to the validation time). I don't think we can do any better on validation time. After consulting @davidgamez, we believe we might need to push back this issue until we have the custom validation profile (also mentioned in #1441) i.e. the url accessibility check would be an optional notice/validation. We believe it is essential as the validation is highly dependant on the user network and can affect the user experience. Thoughts?
I support delaying the issue until consumers can skip
a validation notice. Few points to support it,
- Connectivity on the validator's machine is not a
must have
; not connected machines will get failing notices that cannot be silent. - Huge list of URLs to validate will hit the validator's performance
- Low network connectivity will hit the validator's performance
- Network connection issues can lead to notices being generated.