gtfs-validator icon indicating copy to clipboard operation
gtfs-validator copied to clipboard

Deadlock

Open gcamp opened this issue 3 years ago • 5 comments

Bug report

Describe the bug

The validator seems to deadlock sometimes. It seems to be related to out of memory errors.

How we reproduce the bug We were not able to reproduce it in a reliable way.

Screenshots:

If I run strace -s 99 on the process while it's deadlock this is what I get

[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843564252}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843581123}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894077086}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894187471}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944341980}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] <... futex resumed>)      = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944429322}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994717560}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994734541}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45116513}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45146218}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066749] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066749] futex(0x7f3d340cec28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066749] getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=4, tv_usec=485045}, ru_stime={tv_sec=1, tv_usec=888009}, ...}) = 0
[pid 1066749] futex(0x7f3d340cec78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=322073722}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95507682}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95522138}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145870548}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145884474}, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 1066743 detached

The error log is :

{"notices":[{"code":"runtime_exception_in_validator_error","severity":"ERROR","totalNotices":8,"sampleNotices":[{"validator":"org.mobilitydata.gtfsvalidator.validator.ShapeToStopMatchingValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeTripIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.LocationHasStopTimesValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsabilityValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeStopIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsageValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.StopTimeTravelSpeedValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.BlockTripsWithOverlappingStopTimesValidator","exception":"java.lang.NullPointerException","message":""}]},{"code":"thread_execution_error","severity":"ERROR","totalNotices":1,"sampleNotices":[{"exception":"java.lang.OutOfMemoryError","message":"Java heap space"}]}]}

Environment versions

  • validator version: v3.0.0
  • Java version: openjdk 11.0.13 2021-10-19 OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04) OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
  • OS versions: Ubuntu 20.04.3 LTS

gcamp avatar Dec 20 '21 16:12 gcamp

Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.

github-actions[bot] avatar Dec 20 '21 16:12 github-actions[bot]

Update : we were able to reproduce every time by forcing a memory limit in a cgroup in linux. Using a limit of 500MB and this feed reproduces the bug every time. Using 100MB, the problem is not reproduced and the process is killed by the kernel.

Using cgroups and a memory limit that's big enough to start the process but big enough to make it run all the rules is enough to reproduce the problem.

gcamp avatar Dec 20 '21 19:12 gcamp

Hi Guillaume,

Thanks for reporting this issue! From a brief look, it seems the validator is failing with OutOfMemoryError, I can reproduce it by adding -Xmx500m java argument.

java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -u https://data.bus-data.dft.gov.uk/timetable/download/gtfs-file/south_west/ -o output

I can see that the feed you provided has ~600MB in size. We didn't do heavy memory optimizations and on some bigger feeds 3-6GB should be enough. We would recommend to reserve at least 3GB of memory, this could be done by passing -Xms3g -Xmx3g arguments.

Could you please provide a bit more context to make sure that we understand the problem correctly?

  • What command line do you use to start the validator?
  • What do you mean by deadlock? Does the validator get stuck and it must be terminated manually?
  • What's the expected behaviour?

asvechnikov2 avatar Jan 25 '22 00:01 asvechnikov2

@asvechnikov2 yes, the validator deadlocks. It rans out of memory but the problem is actually the deadlock : it hits a point where it's not running anything but never returns or crash. An OutOfMemory crash is actually the expected behaviour.

java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -i path_to_gtfs -o output is what we use.

More context : we use cgroups in our data pipeline to make sure a process doesn't use more memory than it's expected to use. We use historical running data to know how much memory to allocate, so if a feed X used 100MB the last time it ran, we'll use a limit of 100MB + a buffer to avoid slowdown on the server.

To reproduce the bug you really need to use cgroups memory limits of the same size of the -Xmx argument.

gcamp avatar Jan 25 '22 01:01 gcamp

Hi @gcamp !

Thanks for the details. Unfortunately, we were not able to reproduce the exact same behavior using cgroup + -Xmx, where the validator deadlocks. In every test run, the process is killed when a memory problem occurs. Could you provide a code snippet to see if your use of cgroup differs from ours?

That being said, like you, we noticed that using cgroup causes problems, so we ran a script to compare cgroup + -Xmx against -Xmx alone. We ran our tests on a DigitalOcean VM with the following specifications:

  • OpenJDK version “11.0.13” 2021-10-19
  • OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
  • OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Here are our observations. Our scripts and results can be found in this archive.

  • For cgroup + -Xmx: once the memory limit set for the cgroup is reached, the process is killed and the validator cannot even generate its validator and system reports.
  • For -Xmx: the validator fails due to an OutOfMemoryError when the memory limit is reached, but is able to generate its reports. deadlock

It seems pretty clear that using cgroup with -Xmx causes a problem at this point. We define as uncompleted processes all processes where the validator was unable to generate its reports before being killed, including errors other than memory errors.

As the figure above suggests, I suggest you increase the memory limit of your cgroup to at least 3.5gb when using the validator, or drop cgroup and use only -Xmx. We realize that this may not be the best solution for your current use case. Would this be an option for you?

Our understanding is that the problem may be caused by cgroup itself rather than the validator, but this is still uncertain. Optimizing the validator memory usage is in our backlog, although it is not planned in the short term. If there is a way for you to provide us with more details on how to reproduce the specific deadlock bug, please let us know so we can dig deeper.

maximearmstrong avatar Feb 21 '22 23:02 maximearmstrong