data-infra
data-infra copied to clipboard
GTFS RT validation: Investigate `[Errno 2] No such file or directory` issues
[Errno 2] No such file or directory is our most common cause of validation failures. (In the midnight UTC hour on 6/29/23, occurred for 13 distinct URLs, 4 of them for the entire hour.)
See for example:
SELECT
name,
SUBSTR(validation_exception, 0, 35),
COUNT(*)
FROM `cal-itp-data-infra.staging.stg_gtfs_rt__vehicle_positions_validation_outcomes`
WHERE dt = '2023-06-29' AND hour = '2023-06-29T00:00:00' AND process_stderr IS NULL AND NOT validation_success
GROUP BY 1,2
ORDER BY 3 DESC
The goal of this ticket is to try to investigate why this is happening.
Per @atvaccaro:
IIRC these are situations where the validator is not writing an output JSON file to the expected path
Still a valid issue.
This came up again today. @evansiroky was asking some questions about B-Line GTFS RT validation and I found that ~1/3 of validation attempts on their data are failing with this exception.
Here's an example looking at two dates:
SELECT
feed_type,
dt,
SUBSTR(validation_exception,1,35),
COUNT(DISTINCT hour),
MIN(hour),
MAX(hour),
COUNT(*)
FROM `cal-itp-data-infra.staging.int_gtfs_quality__rt_validation_outcomes`
WHERE dt IN ("2023-11-13", "2023-12-11") AND base64_url IN ("aHR0cHM6Ly9wbGFubXlyaWRlLmJsaW5ldHJhbnNpdC5jb20vZ3Rmc3JlYWx0aW1lL0dURlNfVHJpcFVwZGF0ZXMucGI=", "aHR0cHM6Ly9wbGFubXlyaWRlLmJsaW5ldHJhbnNpdC5jb20vZ3Rmc3JlYWx0aW1lL0dURlNfU2VydmljZUFsZXJ0cy5wYg==", "aHR0cHM6Ly9wbGFubXlyaWRlLmJsaW5ldHJhbnNpdC5jb20vZ3Rmc3JlYWx0aW1lL0dURlNfVmVoaWNsZVBvc2l0aW9ucy5wYg==")
GROUP BY 1,2,3
ORDER BY 2,1,3
What I'm particularly confused by is that this implies there are hours where some attempts succeed and others fail, even though we are validating at the hour level. So it seems like this is perhaps actually related to whether or not we can load the data to be validated and is not an error with the validator actually running for that hour.
Looking at a specific hour of validation for B-Line Alerts, it seems like specifically the attempts to find files associated with the round minute timestamps (00:01:00, 00:02:00, etc.) are failing (though midnight itself succeeds):
SELECT extract_ts, validation_success, validation_exception, download_response_code
FROM `cal-itp-data-infra.staging.int_gtfs_quality__rt_validation_outcomes`
WHERE dt = "2023-11-13" AND hour = "2023-11-13T00:00:00+00:00" AND base64_url IN ("aHR0cHM6Ly9wbGFubXlyaWRlLmJsaW5ldHJhbnNpdC5jb20vZ3Rmc3JlYWx0aW1lL0dURlNfU2VydmljZUFsZXJ0cy5wYg==")
ORDER BY extract_ts
I am not totally sure what to make of this but this does seem like a bit of a new thread that could be investigated.
For overall prevalence (again using just two dates to sample), it seems like this issue affects about a third of feeds:
SELECT
feed_type,
dt,
SUBSTR(validation_exception,1,35),
COUNT(DISTINCT base64_url),
COUNT(DISTINCT hour),
MIN(hour),
MAX(hour),
COUNT(*)
FROM `cal-itp-data-infra.staging.int_gtfs_quality__rt_validation_outcomes`
WHERE dt IN ("2023-11-13", "2023-12-11")
GROUP BY 1,2,3
ORDER BY 2,1,3
And for the affected feeds it seems like it affects only about 40% of files (on 11/13/2023, 28 feeds affected but only 49k files -- 49k / ( 28 * 4,320 ≈ 121k) ≈ 40%.
@ohrite and I are pairing on this issue. So far we found the message is a Python error and could be at this line:
https://github.com/cal-itp/data-infra/blob/968dcfa220ad772e4743da0afd6cd302aa49eba9/jobs/gtfs-rt-parser-v2/gtfs_rt_parser.py#L415
We will keep reading the code to find the problem.
After surface-level investigation, it appears that this error message occurs when the GTFS-RT validator can't open the file, as mentioned above. We recommend developing an automated test suite for this tool to decisively answer this question, ease future debugging, and automate dependency upgrades. What course of action should be taken?
After conversation with @evansiroky, we're going to introduce some additional automated testing to cover these cases as we fix the problem
While writing characterization tests for gtfs-rt-validator-v2 @erikamov and I came across what appears to be a potential source of errors related to thread safety.
During the validate step, gtfs-rt-validator-v2 follows these steps:
- Download a copy of the GTFS schedule
.zipfile from GCS to a temporary directory - Download a copy of GTFS-RT data from GCS to a subdirectory in the temporary directory
- Run the
gtfs-realtime-validator.jaragainst the.zipfile and GTFS-RT data, which outputs a.jsonfile - Upload the resulting
.jsonfile to GCS
The gtfs-rt-validator-v2 script launches 4 threads by default to process these archives in parallel. We have found that during Step 1, if the GTFS schedule .zip file is named the same as another, it will be overwritten. We are concerned that, if this happens in close enough proximity, ie before the JVM boots and the .jar file reads it, the schedule validated against may be incorrect, resulting in false data.
As an example from the logs:
validating /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/rt_e37f5b574b6ad6f74b6c37f675651c88/ with /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/gtfs.zip
executing rt_validator: java -jar None -gtfs /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/gtfs.zip -gtfsRealtimePath /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/rt_e37f5b574b6ad6f74b6c37f675651c88/ -sort name
validating /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/rt_5a98989660a9dfccafb1c3d2556d5fa2/ with /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/gtfs.zip
executing rt_validator: java -jar None -gtfs /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/gtfs.zip -gtfsRealtimePath /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmp_koonpk3/rt_5a98989660a9dfccafb1c3d2556d5fa2/ -sort name
When running in the multithreaded configuration, two classes of errors are visible:
[Errno 2] No such file or directory(686 occurrences vs 130 when not multithreaded)- The following decompression failure, which does not occur when running single-threaded (another variant is visible when
[main] INFO org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor - Reading GTFS data from /var/folders/_l/q3hq8y3s17xc6vl_wjk9tgy00000gn/T/tmpanj1jr_x/google_transit.zip...
Exception in thread "main" org.onebusaway.csv_entities.exceptions.CsvEntityIOException: io error: entityType=org.onebusaway.gtfs.model.StopTime path=stop_times.txt lineNumber=360508
at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:161)
at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:171)
at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:159)
at org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor.readGtfsData(BatchProcessor.java:315)
at org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor.processFeeds(BatchProcessor.java:139)
at org.mobilitydata.gtfsrtvalidator.lib.Main.main(Main.java:62)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.base/java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:447)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:333)
at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:376)
at java.base/sun.nio.cs.StreamDecoder.lockedRead(StreamDecoder.java:219)
at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:173)
at java.base/java.io.InputStreamReader.read(InputStreamReader.java:189)
at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
at java.base/java.io.BufferedReader.implReadLine(BufferedReader.java:371)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:348)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:437)
at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:148)
... 7 more
During tests @ohrite and I found some cases where files are skipped during validation without a message or indication of error. Investigating the code in gtfs-realtime-validator we found the reason on this line: When the current and previous file have the same hash (the same content) it skips the file considering it is duplicated.
example:
[main] INFO org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor - Read feed__2024-10-17T00:20:20Z to byte array in 0.0 seconds
[main] INFO org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor - Read feed__2024-10-17T00:22:40Z to byte array in 0.0 seconds
[main] INFO org.mobilitydata.gtfsrtvalidator.lib.batch.BatchProcessor - Decoded feed__2024-10-17T00:22:40Z protobuf in 0.0 seconds
We believe the ZLIB errors are stored in the outcomes table as
Command '['java', '-jar', '/gtfs-realtime-validator.jar', '-gtfs', '/tmp/tmpg53e8q5j/gtfs.zip', '-gtfsRealtimePath', '/tmp/tmpg53e8q5j/rt_8ade09de2d6bd4609c2af42696d37f33/', '-sort', 'name']' returned non-zero exit status 1.
Combined with the duplicate MD5s stored as [Errno 2] this constitutes an 80% majority of error conditions in the last week.
After conversation with @evansiroky, we'll skip duplicates rather than writing [Errno 2] to outcomes.
We are working on add HTTP record (vcr) to pytests in order to run on Github Actions.
We merged the change to fix the multithread issue and reduce the number of errors.
Now, Doc and I are working on skip duplicates rather than writing [Errno 2] to outcomes.
We are working on add HTTP record (vcr) to pytests in order to run on Github Actions.
VCR did not work due to a HTTP 400 error: Invalid multipart request with 0 mime parts, so we skipped tests on CI until we can use GCS on Github Actions.
Tests still working well locally. To run use: poetry run pytest --gcs
After merging the change to fix the Multithread issue on (11/18/2024) we can see a big decrease on the numbers of errors.
First group with BLUE columns shows BEFORE the change, RED and YELLOW shows AFTER the change:
After fixing the skipping file that were generating the "Errno 2" we can see an increase of the count_all without errors and a big decrease on errors.
- BLUE columns shows BEFORE the change.
- YELLOW shows a secondary error caused by reading invalid formatted files (
b'\x0c\x1f\xd7?2y\xd0\x7f\x01$\xc0\,b'\x19\x13 F^c\x80\xbd\xfeA\xcf\xa0) that we replaced by a friendly messageNo validator output. - PINK columns shows the current numbers.
Nice work!