gtfs-validator
gtfs-validator copied to clipboard
Deadlock
Bug report
Describe the bug
The validator seems to deadlock sometimes. It seems to be related to out of memory errors.
How we reproduce the bug We were not able to reproduce it in a reliable way.
Screenshots:
If I run strace -s 99
on the process while it's deadlock this is what I get
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843564252}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843581123}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894077086}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894187471}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] <... futex resumed>) = 0
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944341980}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] <... futex resumed>) = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944429322}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>) = 0
[pid 1066759] <... futex resumed>) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994717560}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994734541}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>) = 0
[pid 1066759] <... futex resumed>) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45116513}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45146218}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066749] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066749] futex(0x7f3d340cec28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066749] getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=4, tv_usec=485045}, ru_stime={tv_sec=1, tv_usec=888009}, ...}) = 0
[pid 1066749] futex(0x7f3d340cec78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=322073722}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>) = 0
[pid 1066759] <... futex resumed>) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95507682}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95522138}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>) = 0
[pid 1066759] <... futex resumed>) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145870548}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145884474}, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 1066743 detached
The error log is :
{"notices":[{"code":"runtime_exception_in_validator_error","severity":"ERROR","totalNotices":8,"sampleNotices":[{"validator":"org.mobilitydata.gtfsvalidator.validator.ShapeToStopMatchingValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeTripIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.LocationHasStopTimesValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsabilityValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeStopIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsageValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.StopTimeTravelSpeedValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.BlockTripsWithOverlappingStopTimesValidator","exception":"java.lang.NullPointerException","message":""}]},{"code":"thread_execution_error","severity":"ERROR","totalNotices":1,"sampleNotices":[{"exception":"java.lang.OutOfMemoryError","message":"Java heap space"}]}]}
Environment versions
- validator version: v3.0.0
- Java version: openjdk 11.0.13 2021-10-19 OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04) OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
- OS versions: Ubuntu 20.04.3 LTS
Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.
Update : we were able to reproduce every time by forcing a memory limit in a cgroup in linux. Using a limit of 500MB and this feed reproduces the bug every time. Using 100MB, the problem is not reproduced and the process is killed by the kernel.
Using cgroups and a memory limit that's big enough to start the process but big enough to make it run all the rules is enough to reproduce the problem.
Hi Guillaume,
Thanks for reporting this issue! From a brief look, it seems the validator is failing with OutOfMemoryError, I can reproduce it by adding -Xmx500m
java argument.
java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -u https://data.bus-data.dft.gov.uk/timetable/download/gtfs-file/south_west/ -o output
I can see that the feed you provided has ~600MB in size. We didn't do heavy memory optimizations and on some bigger feeds 3-6GB should be enough. We would recommend to reserve at least 3GB of memory, this could be done by passing -Xms3g -Xmx3g
arguments.
Could you please provide a bit more context to make sure that we understand the problem correctly?
- What command line do you use to start the validator?
- What do you mean by deadlock? Does the validator get stuck and it must be terminated manually?
- What's the expected behaviour?
@asvechnikov2 yes, the validator deadlocks. It rans out of memory but the problem is actually the deadlock : it hits a point where it's not running anything but never returns or crash. An OutOfMemory crash is actually the expected behaviour.
java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -i path_to_gtfs -o output
is what we use.
More context : we use cgroups in our data pipeline to make sure a process doesn't use more memory than it's expected to use. We use historical running data to know how much memory to allocate, so if a feed X used 100MB the last time it ran, we'll use a limit of 100MB + a buffer to avoid slowdown on the server.
To reproduce the bug you really need to use cgroups memory limits of the same size of the -Xmx
argument.
Hi @gcamp !
Thanks for the details. Unfortunately, we were not able to reproduce the exact same behavior using cgroup
+ -Xmx
, where the validator deadlocks. In every test run, the process is killed when a memory problem occurs. Could you provide a code snippet to see if your use of cgroup
differs from ours?
That being said, like you, we noticed that using cgroup
causes problems, so we ran a script to compare cgroup
+ -Xmx
against -Xmx
alone. We ran our tests on a DigitalOcean VM with the following specifications:
- OpenJDK version “11.0.13” 2021-10-19
- OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
- OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
Here are our observations. Our scripts and results can be found in this archive.
- For
cgroup
+-Xmx
: once the memory limit set for thecgroup
is reached, the process is killed and the validator cannot even generate its validator and system reports. - For
-Xmx
: the validator fails due to anOutOfMemoryError
when the memory limit is reached, but is able to generate its reports.
It seems pretty clear that using cgroup
with -Xmx
causes a problem at this point. We define as uncompleted processes all processes where the validator was unable to generate its reports before being killed, including errors other than memory errors.
As the figure above suggests, I suggest you increase the memory limit of your cgroup
to at least 3.5gb when using the validator, or drop cgroup
and use only -Xmx
. We realize that this may not be the best solution for your current use case. Would this be an option for you?
Our understanding is that the problem may be caused by cgroup
itself rather than the validator, but this is still uncertain. Optimizing the validator memory usage is in our backlog, although it is not planned in the short term. If there is a way for you to provide us with more details on how to reproduce the specific deadlock bug, please let us know so we can dig deeper.