swipl-devel icon indicating copy to clipboard operation
swipl-devel copied to clipboard

Sporadic issue, possibly with garbage collection

Open spl opened this issue 2 years ago • 13 comments

We've had several sporadic failures at around the same time in the TerminusDB continuous integration tests with SWI-Prolog 8.5.12.

Several of the failures were in our Rust layer, but one gave us a Prolog stack trace:

% PL-Unit: infer 
SWI-Prolog [thread 2 (gc) at Fri Jun 10 07:57:04 2022]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
  [0] PL_advance_hash_table_enum() at ??:? [0x7ff00172b755]
  [1] PL_advance_hash_table_enum() at ??:? [0x7ff00172b81d]
  [2] PL_advance_hash_table_enum() at ??:? [0x7ff00172b92e]
  [3] killpg() at ??:? [0x7ff00149c090]
  [4] PL_blob_data() at ??:? [0x7ff001776a96]
  [5] PL_thread_self() at ??:? [0x7ff001762372]
  [6] PL_thread_self() at ??:? [0x7ff001766fbe]
  [7] PL_find_blob_type() at ??:? [0x7ff00177d41c]
  [8] PL_atom_generator_w() at ??:? [0x7ff001687f15]
  [9] PL_call_predicate() at ??:? [0x7ff001776b36]
  [10] PL_thread_destroy_engine() at ??:? [0x7ff0016ddc16]
  [11] start_thread() at ??:? [0x7ff001217609]
  [12] clone() at ??:? [0x7ff001578133]
Prolog stack:
  [4] system:garbage_collect_atoms/0 [PC=1 in supervisor]
  [2] $gc:gc_loop/0 [PC=[29](https://github.com/terminusdb/terminusdb/runs/6827049706?check_suite_focus=true#step:7:30) in clause 1]
  [0] system:$c_call_prolog/0 [PC=0 in top query clause]
Running on_halt hooks with status 1[39](https://github.com/terminusdb/terminusdb/runs/6827049706?check_suite_focus=true#step:7:40)
Killing 7767 with default signal handlers
Segmentation fault (core dumped)
make[1]: *** [distribution/Makefile.prolog:[48](https://github.com/terminusdb/terminusdb/runs/6827049706?check_suite_focus=true#step:7:49): test] Error 139

Even though the failures manifest differently, they all happen around the same point in the unit tests run (the infer tests), which could be the time that GC is triggered.

spl avatar Jun 16 '22 09:06 spl

These are the CI logs:

  • 253: https://github.com/terminusdb/terminusdb/runs/6827049706
  • 254: https://github.com/terminusdb/terminusdb/runs/6840565238
  • 258: https://github.com/terminusdb/terminusdb/runs/6892589239

spl avatar Jun 16 '22 09:06 spl

The C stack trace doesn't match up with the Prolog one. I guess Prolog was compiled without debug symbols or the symbols were stripped? So yes, it is most likely a crash in the atom garbage collector. Note that may also be from the blob release handlers. First step is probably to make sure the debug symbols are in so we get a better indication of the crash location.

JanWielemaker avatar Jun 16 '22 10:06 JanWielemaker

We're installing on Ubuntu 20.04.4 from the PPA:

sudo apt-add-repository ppa:swi-prolog/devel
sudo apt-get update
sudo apt-get install swi-prolog

spl avatar Jun 16 '22 10:06 spl

I think that by default doesn't install the symbols :cry: You'll have to build from source. The -DCMAKE_BUILD_TYPE=PGO version does add debug symbols (keeping optimization).

I guess you can also run the relevant CI steps locally, no? If the stack trace gives no clear answer that is probably what needs to happen anyway. Typically stack traces don't help much for debugging real GC issues :cry:

JanWielemaker avatar Jun 16 '22 10:06 JanWielemaker

Part of the problem here is that it doesn't always happen, and when it does happen it tends to happen in slightly different places. So there's no exact replication script here.

If I remember correctly, garbage collection was recently tweaked. Maybe by figuring out what changed we can work back to what's going wrong here. Looking in two of the CI runs that failed here, something appears to go wrong while working with a 'builder', a data structure we use to build up new triple layers, which in TerminusDB we refer to through a prolog blob. My suspicion is that this blob is simultaneously being released from the gc thread (since the other failure we got appears to happen during garbage_collect_atoms). This obviously shouldn't happen cause the fact that we're calling a function which uses this blob is proof that it is still being referred to somewhere.

Was there any change to garbage collection that could explain this sudden change of behavior?

matko avatar Jun 16 '22 10:06 matko

Was there any change to garbage collection that could explain this sudden change of behavior?

No. Well, there is modification to the stream blob release handler that optionally closes abandoned streams and there is a change to PL_blob_data() to return NULL if the blob has been released. There is a bit older change that reclaims all atoms on exit, but this only happens when compiled as Debug or Sanitize build-type or when the application explicitly calls PL_cleanup(). This doesn't seem to be the case as the gc thread is still running.

Gc kicking in while something is still being used is of course a good explanation :smile: I think you know the theory and the tricks to hunt this?

JanWielemaker avatar Jun 16 '22 11:06 JanWielemaker

I don't know what sort of timeline we're talking about, but I just wanted to add that the TerminusDB tests were failing due to other problems before 8.5.12, so we don't know how far back in the sequence of 8.5 releases the problem might have been introduced.

spl avatar Jun 16 '22 11:06 spl

This looks like an old bug. I've been stressing atom-gc a lot recently. There have not been changes to atom-gc for a long time. The depth of the C stack hints at a crash inside a blob release handler (but that is not sure). I'd first get a proper stack trace. That should tell us we are in a release handler (or not) and at least a hint on what could be the problem.

JanWielemaker avatar Jun 16 '22 11:06 JanWielemaker

FYI, the PPA now provides a .deb with the debug symbols. See https://www.swi-prolog.org/build/PPA.html for installation instructions.

JanWielemaker avatar Jun 22 '22 11:06 JanWielemaker

Something similar also happens to us when executing rather complex procedures but which make use of standard objects (lists, dict, ...), built-in predicate (findall, folds, aggregate, ...) and an RDF database. The problem occurs quite constantly on the first execution of the predicate immediately after starting the top-level prolog session, consulting the sources and loading the RDF database. If the crash doesn't happen in this situation it doesn't happen anymore (problems with the initial indexing?). Usually, during the execution of the predicate in the above situation, this happens:

SWI-Prolog [thread 2 (gc) at Wed Sep 14 09:41:12 2022]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
Prolog stack:
  [4] system:garbage_collect_atoms/0 [PC=1 in supervisor]
  [2] $gc:gc_loop/0 [PC=29 in clause 1]
  [0] system:$c_call_prolog/0 [PC=0 in top query clause]
Running on_halt hooks with status 139
Killing 7555 with default signal handlers
Segmentation fault: 11

Around the time of the crash I have these values:

statistics (atom_space, 120685251)
statistics (c_stack, 8388608)

and the standard threads: main, gc, __rdf_GC.

Other times after the execution is finished, at the prolog prompt this has happened:

?- 
SWI-Prolog [thread 2 (gc) at Tue Sep 13 18:24:37 2022]: received fatal signal 4 (ill)
C-stack trace labeled "crash":
  [0] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(save_backtrace+0xae) [0x10355024e]
  [1] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(sigCrashHandler+0xdb) [0x10355091b]
  [2] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(dispatch_signal+0x25f) [0x1034a706f]
  [3] /usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1d) [0x7fff20622d7d]
  [5] /usr/lib/system/libsystem_pthread.dylib(pthread_mutex_destroy+0x2f) [0x7fff205d9956]
  [6] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(destroy_indirect_table+0x1d) [0x1034eeb3d]
  [7] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(release_trie_ref+0x59) [0x1034ed699]
  [8] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(pl_garbage_collect_atoms+0x333) [0x10341d923]
  [9] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(PL_next_solution___LD+0xfa94) [0x1034305f4]
  [10] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(PL_call_predicate+0x28) [0x103516a18]
  [11] /Applications/SWI-Prolog.app/Contents/Frameworks/libswipl.8.5.16.dylib(GCmain+0x12d) [0x1034c8aad]
  [12] /usr/lib/system/libsystem_pthread.dylib(_pthread_start+0xe0) [0x7fff205dd8fc]
  [13] /usr/lib/system/libsystem_pthread.dylib(thread_start+0xf) [0x7fff205d9443]
Prolog stack:
  [4] system:garbage_collect_atoms/0 [PC=1 in supervisor]
  [2] $gc:gc_loop/0 [PC=29 in clause 1]
  [0] system:$c_call_prolog/0 [PC=0 in top query clause]
Running on_halt hooks with status 132
Killing 761 with default signal handlers
Illegal instruction: 4

It should be noted that with version 8.5.8 and earlier the problem does not occur, with 8.5.9 I don't know because I don't have it, from 8.5.10 and later almost always.

Bartola64 avatar Sep 14 '22 08:09 Bartola64

Doesn't look related. The original problem seems with atom-GC itself, this seems a memory management issue with tables that are reclaimed as a result of atom-GC. That is, if the stack is correct which is not always the case on MacOS. The stack does look plausible though.

Checking https://www.swi-prolog.org/ChangeLog?branch=development&from=8.5.8&to=8.5.10 we see quite some fixes to leaking memory. That can be consistent with the report. Without code to reproduce it will be hard to fix though. The way to find this is to compile Prolog with AddressSanitizer support and run this code.

JanWielemaker avatar Sep 14 '22 09:09 JanWielemaker

Thanks for the reply. Do you suggest opening a new issue?

Bartola64 avatar Sep 14 '22 09:09 Bartola64

Do you suggest opening a new issue?

Yes. Unless I miss something in the github interface it seems impossible to create a new issue from a comment on an existing unrelated issue :cry:

JanWielemaker avatar Sep 14 '22 09:09 JanWielemaker

Original problem seems fixed. Closing.

JanWielemaker avatar Apr 19 '23 15:04 JanWielemaker