Oscar.jl icon indicating copy to clipboard operation
Oscar.jl copied to clipboard

Error on 1.10 ubuntu long

Open thofma opened this issue 1 year ago • 22 comments

If one looks at https://github.com/oscar-system/Oscar.jl/commits/master/, one sees that often "Run tests / test (~1.10.0-0, long, ubuntu-latest) (push)" fails. The error looks scary, e.g. in https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4952 and https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:26094:

!!! ERROR in jl_ -- ABORTING !!!

Does anyone have an idea where that might be coming from? I have not tried to reproduce it locally. It does not look like https://github.com/oscar-system/Oscar.jl/issues/2441.

CC: @lgoettgens @benlorenz

thofma avatar Jan 12 '24 16:01 thofma

No Idea

lgoettgens avatar Jan 12 '24 16:01 lgoettgens

Some weird GC corruption that seems to happen when the Serialization/IPC tests happen, it seems related to julia tasks but I haven't been able to reproduce this locally. I have the long testset running in a loop with rr to trigger and capture this (currently at about 100 iterations).

So far I got only one other crash but in the test group elliptic_surfaces.jl that runs before the IPC stuff:

[4832] signal (11.1): Segmentation fault
in expression starting at /home/datastore/lorenz/software/julia/Oscar.jl/test/AlgebraicGeometry/Schemes/elliptic_surface.jl:1
jl_object_id__cold at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:455
type_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1575
typekey_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1605
jl_precompute_memoized_dt at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1685
inst_datatype_inner at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2081
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2176
arg_type_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2232 [inlined]
jl_lookup_generic_ at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3020 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3072
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:834
unknown function (ip: 0x1522095c16a5)
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:216
unknown function (ip: 0x1522095c11c9)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#197 at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:92
unknown function (ip: 0x1522095c141c)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_normal_value at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:152
unknown function (ip: 0x1522095c1336)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:223
unknown function (ip: 0x1522095c11c9)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
low_level_caller_rng at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:378
minAssGTZ at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/Meta.jl:45
unknown function (ip: 0x1522095c0389)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#minimal_primes#335 at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:830
minimal_primes at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:818 [inlined]
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1255
#356 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479    
get_attribute! at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:230 [inlined]
is_prime at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1254
unknown function (ip: 0x1521620272f5)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpolyquo-localizations.jl:1853
#914 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479                    
unknown function (ip: 0x152162026da0)    

benlorenz avatar Jan 12 '24 17:01 benlorenz

typeinf_local and deserialize occur, perhaps something related to type inference in the deserialization, like compiler getting an unexpected type. Imagine something like this could happen in deserialization, but why only in this test?

jankoboehm avatar Jan 12 '24 18:01 jankoboehm

The same crash as described by @benlorenz happened also in the corresponding test run for #3018 after the changes that were pushed yesterday.

ThomasBreuer avatar Jan 15 '24 08:01 ThomasBreuer

The second backtrace reported in here by @benlorenz involves Singular.jl and the primdec library function minAssGTZ -- specifically the code in Singular.jl which converts its return value to Julia. Maybe there is a GC.preserve missing there or some other bug. Perhaps it causes a memory corruption and then triggers the second crash, too... even if it not, that needs to be solved.

fingolfin avatar Jan 17 '24 11:01 fingolfin

After digging into the first backtrace again, this is a GC corruption error (https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4949), so this could be due to the same issue.

lgoettgens avatar Jan 17 '24 11:01 lgoettgens

I have a preliminary fix for the crash I reported (jl_object_id__cold) here: https://github.com/oscar-system/Singular.jl/pull/749. This adds a missing GC protection in the libsingular_julia code for passing data from a sleftv back to julia. I want to do some further testing now, unfortunately (for me at least ...) these crashes are rather rare.

benlorenz avatar Jan 18 '24 15:01 benlorenz

The original error (GC error (probable corruption)) also happens on macos, observed during my flint 2.9 backport testing: https://github.com/benlorenz/Oscar.jl/actions/runs/7583365592/job/20655088309#step:9:4981 (but even less often than on ubuntu)

benlorenz avatar Jan 19 '24 12:01 benlorenz

Another occurence (on macOS): https://github.com/oscar-system/Oscar.jl/actions/runs/7585991305/job/20663114060?pr=3213

joschmitt avatar Jan 22 '24 12:01 joschmitt

In both recent occurrences, the crash happend shortly after we see

Testing test/AlgebraicGeometry/Schemes/elliptic_surface.jl [...]

which I think means it is probably in the middle of testing test/Serialization/IPC.jl? (There is no message "Starting tests for ..." before that, perhaps we could add such a message?)

fingolfin avatar Jan 24 '24 09:01 fingolfin

Specifically, if we add a "Starting tests..." message before loading IPC.jl, and also force a full GC before that message, then perhaps we can get a better idea as to whether the corruption happens before IPC.jl, or during it?

fingolfin avatar Jan 24 '24 09:01 fingolfin

I can add the message, but I would like to hold off a bit with adding something like an explicit GC now since we just started doing the tests with libsingular_julia 0.40.11 which is the first version including my sleftv fix. (At least until we see another error with that version...)

benlorenz avatar Jan 24 '24 09:01 benlorenz

It still happens with the new libsingular and even with the explicit GC call it happens within the IPC.jl tests: https://github.com/oscar-system/Oscar.jl/actions/runs/7638814195/job/20810486432?pr=3229#step:8:4959 Unfortunately I haven't been able to reproduce this crash outside of github actions. I have two jobs running the long testsuite with 300 successful iterations so far.

benlorenz avatar Jan 24 '24 13:01 benlorenz

Also happened https://github.com/oscar-system/Oscar.jl/actions/runs/7638161558/job/20808482695?pr=3226

Could it be that it again can only reproduced on a memory starved machine, with 7-8 GB RAM?

fingolfin avatar Jan 24 '24 13:01 fingolfin

The workers should be less memory starved now, they were recently upgraded to have 4 CPUs and 16 GB of memory.

benlorenz avatar Jan 24 '24 13:01 benlorenz

A recent crash is reported at https://github.com/oscar-system/Oscar.jl/actions/runs/7642653452/job/20822790076?pr=3236

ThomasBreuer avatar Jan 24 '24 21:01 ThomasBreuer

I have opened a PR to disable the IPC test for now while I try to debug this further: https://github.com/oscar-system/Oscar.jl/pull/3246

benlorenz avatar Jan 25 '24 23:01 benlorenz

And herr is an instance of the crash with Julia 1.9: https://github.com/oscar-system/Oscar.jl/actions/runs/7665378425/job/20891166477?pr=3247

fingolfin avatar Jan 26 '24 09:01 fingolfin

Thanks for noticing. That is interesting, it turns out that the effect of doing GC.gc() before the IPC.jl tests seems to increase the rate at which the error occurs. (But still only on github actions so far ...) Maybe that helped trigger this on 1.9 as well.

benlorenz avatar Jan 26 '24 09:01 benlorenz

Our CI looks a lot better now without the IPC.jl tests, which should help with development. But I am continuing to look into this. Please post any further errors you notice in the CI.

I just found this one during QuadFormAndIsom, unfortunately without any backtrace:

Sat, 27 Jan 2024 14:58:45 GMT GC: pause 27.39ms. collected 39.011118MB. incr 
Sat, 27 Jan 2024 14:58:45 GMT corrupted double-linked list
Sat, 27 Jan 2024 14:58:45 GMT
Sat, 27 Jan 2024 14:58:45 GMT [1921] signal (6.-6): Aborted
Sat, 27 Jan 2024 14:58:45 GMT in expression starting at /home/runner/work/Oscar.jl/Oscar.jl/experimental/QuadFormAndIsom/test/runtests.jl:269
Sat, 27 Jan 2024 17:09:25 GMT Error: The operation was canceled.

from https://github.com/oscar-system/Oscar.jl/actions/runs/7679187557/job/20929824694?pr=3212#step:8:1790

benlorenz avatar Jan 28 '24 12:01 benlorenz

After some more debugging I found that the error will quite surely be gone once 1.10.1 is released, fixed via JuliaLang/julia@8a04df0 (#52755). I don't really now why this happens so much more on 1.10 but probably due to the more agressive GC.

In this workflow I have about 150 successful runs of the long group including the IPC.jl tests, with an intermediate julia build from the backports-release-1.10 branch.

So once that is released I will try to reactivate these tests and hopefully close this ticket.

benlorenz avatar Jan 31 '24 11:01 benlorenz

This is back: https://github.com/Nemocas/Nemo.jl/actions/runs/8546742962/job/23417708965?pr=1700

(This downstream test run only checks Oscar.)

thofma avatar Apr 04 '24 05:04 thofma