gap icon indicating copy to clipboard operation
gap copied to clipboard

gap startup error

Open abetten opened this issue 1 year ago • 19 comments

Observed behaviour

[MacBook-Pro:~/SOFT.24] betten% gap
 ┌───────┐   GAP 4.12.2 of 2022-12-18
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: aarch64-apple-darwin22-default64-kv8
 Configuration:  gmp 6.2.1, GASMAN
 Loading the library and packages ...
Syntax warning: Unbound global variable in /Users/betten/SOFT.23/gap-4.12.2/lib/primality.gi:515
  Np:=(N-1)/p;
  ^^

Expected behaviour

Copy and paste GAP banner (to tell us about your setup)

abetten avatar Feb 18 '24 07:02 abetten

Thanks for the report @abetten, are you installing from the git repo or from the release archive? I seem to remember seeing something similar but don't remember what the issue was.

james-d-mitchell avatar Feb 19 '24 19:02 james-d-mitchell

The funny thing is, this is an existing installation that was working before. I do not know if I changed a path or anything like that, but suddenly I get this error message. It would be nice to trace down the exact reason for it. I remember having seen it before, but I cannot remember what I did back then.

abetten avatar Feb 19 '24 19:02 abetten

Release archive or from git?

james-d-mitchell avatar Feb 19 '24 20:02 james-d-mitchell

This is really weird. Can you please tell us the output of

shasum -a 256 /Users/betten/SOFT.23/gap-4.12.2/lib/primality.gi

If the reported checksum differs from

2e598e50b5823f6c8b02d34b83e204e89f91bcb19521dc2e7d56d12205708c80

then maybe upload that file to this issue (you may have to add '.txt' as extension for that).

fingolfin avatar Feb 19 '24 22:02 fingolfin

Just one quick check, is there a 'gap' in SOFT.24, and does it work if you just run GAP from SOFT.23? I notice you are picking up the 'gap' library from SOFT.23 in the SOFT.24 directory. Just wanted to check if there is some very strange path issue going on (even if there is, GAP shouldn't behave like this).

ChrisJefferson avatar Feb 21 '24 05:02 ChrisJefferson

Can't reproduce and no further communication by issue submitter. So closing. If the issue reappears with GAP 4.13.0 (which will be released tomorrow), feel free to re-open.

fingolfin avatar Mar 15 '24 01:03 fingolfin

Just to say that I am now seeing this in the CI jobs of the Semigroups package at, for example:

https://github.com/semigroups/Semigroups/actions/runs/9155759303/job/25168765002?pr=1012

Maybe there's something wrong in the setup of the CI, but this is still an unhelpful way to indicate that. @fingolfin @ChrisJefferson

james-d-mitchell avatar May 20 '24 09:05 james-d-mitchell

No, now I can see the full output, yours looks at first glance to me like good old memory corruption -- you can see it get very upset at the end. Of course, I have no idea why it is happening here and not elsewhere (although the common fact is macs)...

I really don't want to start trying to do debugging via github action...

ChrisJefferson avatar May 20 '24 09:05 ChrisJefferson

Just to be clear, the end of the log looks like:

Syntax warning: Unbound global variable in /Users/runner/gap/lib/primality.gi:\
527
  return [true,a];
               ^
Error, Length: <list> must be a list (not the integer 1518809461113931223)�v�|

Which looks to me like some object corruption has occurred. This could be semigroup's fault hypothetically, but I'm tempted to say not, because it's currently parsing primality.gi, which is before we reach that point.

My best guess is that Apple has done something which is messing up how gasman marks bags, but it could be any weird compiler thing really.

ChrisJefferson avatar May 20 '24 09:05 ChrisJefferson

Could someone with a up to date mac try building stable-4.12 with 'TREMBLE_HEAP' enabled (you could just go into gasman.c and remove the #ifdef TREMBLE_HEAP guards in the two places it appears around "CollectBags(0,0)", then try building GAP and running it?)

Note this is one of those "gosh GAP is going to take a long time to do anything, even start" type options, so don't do it on a machine where you don't mind the fan spinning up for quite a long time (could be hours!)

ChrisJefferson avatar May 20 '24 09:05 ChrisJefferson

Thanks @ChrisJefferson I'll try what you suggested just now.

james-d-mitchell avatar May 20 '24 09:05 james-d-mitchell

I just tried what you suggested @ChrisJefferson using the release archive of GAP 4.12.2, and this doesn't seem to reproduce the error:

❯ ./gap -A
 ┌───────┐   GAP 4.12.2 of 2022-12-18
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: aarch64-apple-darwin22-default64-kv8
 Configuration:  gmp 6.2.1, GASMAN
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.6, PrimGrp 3.4.4, SmallGrp 1.5.3, TransGrp 3.6.5
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
gap>

Here's the config.log file:

config.log

james-d-mitchell avatar May 20 '24 10:05 james-d-mitchell

Not sure that my mac counts as "up to date", unfortunately, it's an M1 from 2021 IIRC.

james-d-mitchell avatar May 20 '24 10:05 james-d-mitchell

Thanks. I'm going to try poking a bit on my PC and see if I can shake anything out.

ChrisJefferson avatar May 20 '24 11:05 ChrisJefferson

Just to mention that this doesn't seem to occur with GAP 4.13:

https://github.com/semigroups/Semigroups/actions/runs/9187837589/job/25266363188?pr=1012

So it is perhaps resolved already.

james-d-mitchell avatar May 22 '24 08:05 james-d-mitchell

I managed, by sshing into github actions on stable-4.12 on the semigroups CI to catch this error.

After a lot of debugging, I have tracked the problem down, I think, to gmp.

_gmpz_mul seems to be writing to a memory location it shouldn't. The memory it writes to isn't allocated yet so doesn't cause a problem in most cases, but the string allocation code assumes the memory it uses will be zeroed, so when writing a string doesn't bother with a null terminator, so we end up with local variables with silly names like pfdjifdsjio (instead of p), which is what causes the "unknown global" message.

The actual error occurs here:

    frame #0: 0x0000000101f0ebe8 libgmp.10.dylib`__gmpn_mul_1c + 200
  * frame #1: 0x0000000101f05228 libgmp.10.dylib`__gmpz_mul + 160
    frame #2: 0x000000010085a9c0 gap`ProdInt(opL=0x00001000003c8130, opR=0x0000000017d78401) at integer.c:1471:3 [opt]
    frame #3: 0x000000010085a54c gap`IntStringInternal(string=0x0000000000000000, str="84128410784489288223092474348389603623030322640088442936747974518239642507631380108010588884252565717918682347709584444173260730941561211749732512257059040264927466644819174048875651367892940295977531020921450283370778464844131921016112826112511277611411962047115457979770639907893271757547513348734936139234492934084356041841547537781640044258066541550710400764797315999285813") at integer.c:1087:19 [opt]
    frame #4: 0x0000000100867bfc gap`IntrIntExpr(intr=0x000000016f5cad80, string=0x0000000000000000, str="84128410784489288223092474348389603623030322640088442936747974518239642507631380108010588884252565717918682347709584444173260730941561211749732512257059040264927466644819174048875651367892940295977531020921450283370778464844131921016112826112511277611411962047115457979770639907893271757547513348734936139234492934084356041841547537781640044258066541550710400764797315999285813") at intrprtr.c:1794:15 [opt]
    frame #5: 0x00000001008f553c gap`ReadLiteral(rs=0x000000016f5ca930, follow=18446744072694563073, mode='r') at read.c:1520:27 [opt]

That huge number occurs in primality.gi, which causes the memory corruption, which is why people see a bug a little later in primality.gi, if they see a bug.

The question (which I don't yet know the answer to) is is this a problem with linking the wrong libgmp (we seem to be linking to GAP's internal libgmp in this case), or is our 'fakegmp' somehow messed up? It's hard to tell what's going on inside libgmp due to a lack of debug symbols.

ChrisJefferson avatar May 22 '24 13:05 ChrisJefferson

Some data dumping:

The memory location incorrectly written to is: 0x100007341fc0

The 3 mpzs passed to mpz_mul by ProdInt are:

(lldb) print mpzResult
(fake_mpz_t) {
  [0] = {
    v = {
      [0] = {
        _mp_alloc = 11
        _mp_size = 0
        _mp_d = 0x00001000072e3fe0
      }
    }
    tmp = 6163310904
    obj = 0x00001000003c8138
  }
}
(lldb) print mpzL
(fake_mpz_t) {
  [0] = {
    v = {
      [0] = {
        _mp_alloc = 10
        _mp_size = 10
        _mp_d = 0x00001000072e3f78
      }
    }
    tmp = 8412841078448928
    obj = 0x00001000003c8130
  }
}
(lldb) print mpzR
(fake_mpz_t) {
  [0] = {
    v = {
      [0] = {
        _mp_alloc = 1
        _mp_size = 1
        _mp_d = 0x000000016f5c9908
      }
    }
    tmp = 100000000
    obj = NULL
  }
}

ChrisJefferson avatar May 22 '24 13:05 ChrisJefferson

make check is segfaulting all over the place, so I wonder if it's just gnump 6.2.1 doesn't support macs properly, which is what is in 4.12 (we have 6.3 in the latest release).

A feature of 6.3 is "Support for 64-bit Arm under Macos. "

ChrisJefferson avatar May 22 '24 13:05 ChrisJefferson

Well done, Chris!! Excellent work.

Best, Anton

On May 22, 2024, at 4:28 PM, Christopher Jefferson @.***> wrote:

** Caution: EXTERNAL Sender **

I managed, by sshing into github actions on stable-4.12 on the semigroups CI to catch this error.

After a lot of debugging, I have tracked the problem down, I think, to gmp.

_gmpz_mul seems to be writing to a memory location it shouldn't. The memory it writes to isn't allocated yet so doesn't cause a problem in most cases, but the string allocation code assumes the memory it uses will be zeroed, so when writing a string doesn't bother with a null terminator, so we end up with local variables with silly names like pfdjifdsjio (instead of p), which is what causes the "unknown global" message.

The actual error occurs here:

frame #0: 0x0000000101f0ebe8 libgmp.10.dylib`__gmpn_mul_1c + 200
  • frame #1: 0x0000000101f05228 libgmp.10.dylib__gmpz_mul + 160 frame #2: 0x000000010085a9c0 gapProdInt(opL=0x00001000003c8130, opR=0x0000000017d78401) at integer.c:1471:3 [opt] frame #3: 0x000000010085a54c gapIntStringInternal(string=0x0000000000000000, str="84128410784489288223092474348389603623030322640088442936747974518239642507631380108010588884252565717918682347709584444173260730941561211749732512257059040264927466644819174048875651367892940295977531020921450283370778464844131921016112826112511277611411962047115457979770639907893271757547513348734936139234492934084356041841547537781640044258066541550710400764797315999285813") at integer.c:1087:19 [opt] frame #4: 0x0000000100867bfc gapIntrIntExpr(intr=0x000000016f5cad80, string=0x0000000000000000, str="84128410784489288223092474348389603623030322640088442936747974518239642507631380108010588884252565717918682347709584444173260730941561211749732512257059040264927466644819174048875651367892940295977531020921450283370778464844131921016112826112511277611411962047115457979770639907893271757547513348734936139234492934084356041841547537781640044258066541550710400764797315999285813") at intrprtr.c:1794:15 [opt] frame #5: 0x00000001008f553c gap`ReadLiteral(rs=0x000000016f5ca930, follow=18446744072694563073, mode='r') at read.c:1520:27 [opt]

That huge number occurs in primality.gi, which causes the memory corruption, which is why people see a bug a little later in primality.gi, if they see a bug.

The question (which I don't yet know the answer to) is is this a problem with linking the wrong libgmp (we seem to be linking to GAP's internal libgmp in this case), or is our 'fakegmp' somehow messed up? It's hard to tell what's going on inside libgmp due to a lack of debug symbols.

— Reply to this email directly, view it on GitHubhttps://github.com/gap-system/gap/issues/5640#issuecomment-2124800711, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIWGLPEJGUZDO4EOQBMQMLZDSMP3AVCNFSM6AAAAABDN6PGV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUHAYDANZRGE. You are receiving this because you were mentioned.Message ID: @.***>

abetten avatar May 22 '24 13:05 abetten