g10k icon indicating copy to clipboard operation
g10k copied to clipboard

G10k hanging at control_repo branch iteration

Open nickTh-tyn opened this issue 10 months ago β€’ 36 comments

Occasionally g10k is hanging at control_repo branch iteration. There are 4 child processes, that are just sleeping, looking like a deadlock (checking by ps, pstree, lsof, iotop).

Plarform: Linux amd64 Version: github release 0.9.9

$ g10k --version g10k v0.9.9-1-gfc83c96 Build time: 2024-02-08_11:56:54 UTC

Config file g10k.yaml :

`--- :cachedir: "./g10k-cache" use_cache_fallback: true retry_git_commands: true ignore_unreachable_modules: false

git: default_ref: main

deploy: purge_levels: ["deployment", "environment", "puppetfile"] purge_allowlist: [ '/.gitkeep', '.latest_revision', '.resource_types', 'resource_types/*.pp', '/.pp' ] deployment_purge_allowlist: [ 'production_s' ] generate_types: false

sources: control_repo: &control_repo remote: "https://private_gitlab/control_repo.git" basedir: "./environments" filter_regex: "^(production|staging)$" force_forge_versions: false prefix: false exit_if_unreachable: true invalid_branches: "correct" warn_if_branch_is_missing: true `

Execute like: g10k -verbose -debug -branch production -config ./g10k.yaml

Logs of problematic execution: ... 2025/01/27 14:02:03 Executing git --git-dir g10k-cache/environments/control_repo.git remote -v took 0.00079s ... 2025/01/27 14:02:04 Executing git --git-dir g10k-cache/environments/control_repo.git remote update --prune took 0.37934s 2025/01/27 14:02:04 DEBUG executeCommand(): Executing git --git-dir g10k-cache/environments/control_repo.git branch 2025/01/27 14:02:04 Executing git --git-dir g10k-cache/environments/control_repo.git branch took 0.00108s 2025/01/27 14:02:04 DEBUG Skipping branch main of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Environment staging of source control_repo does not match branch name filter 'production', skipping 2025/01/27 14:02:04 DEBUG Skipping branch test-branch1 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch test-branch2 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch testing of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG 1(): Resolving environment production of source control_repo 2025/01/27 14:02:04 DEBUG Skipping branch testing-branch3 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch testing-branch4 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch testing-branch5 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch testing-branch6 of source control_repo, because of filter_regex setting 2025/01/27 14:02:04 DEBUG Skipping branch testing_branch7 of source control_repo, because of filter_regex setting [ hang indefinitely ]

Logs of correct execution: ... 2025/01/29 08:56:52 Executing git --git-dir g10k-cache/environments/control_repo.git remote -v took 0.00100s ... 2025/01/29 08:56:52 Executing git --git-dir g10k-cache/environments/control_repo.git remote update --prune took 0.26773s 2025/01/29 08:56:52 DEBUG executeCommand(): Executing git --git-dir g10k-cache/environments/control_repo.git branch 2025/01/29 08:56:52 Executing git --git-dir g10k-cache/environments/control_repo.git branch took 0.00146s 2025/01/29 08:56:52 DEBUG Skipping branch main of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Environment staging of source control_repo does not match branch name filter 'production', skipping 2025/01/29 08:56:52 DEBUG Skipping branch test-branch1 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch test-branch2 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch testing of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch testing-branch3 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch testing-branch4 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch testing-branch5 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG 1(): Resolving environment production of source control_repo 2025/01/29 08:56:52 DEBUG Skipping branch testing-branch6 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG Skipping branch testing_branch7 of source control_repo, because of filter_regex setting 2025/01/29 08:56:52 DEBUG executeCommand(): Executing git --git-dir g10k-cache/environments/control_repo.git rev-parse --verify 'production^{object}' 2025/01/29 08:56:52 Executing git --git-dir g10k-cache/environments/control_repo.git rev-parse --verify 'production^{object}' took 0.00103s ...

So last printings (at deadlock case) are about the filtering of most branches ( https://github.com/xorpaul/g10k/blob/v0.9.9/puppetfile.go#L105 ) and for branch "production" the start of syncing https://github.com/xorpaul/g10k/blob/v0.9.9/puppetfile.go#L122 . At first look, it seems that the random order of branch iteration is causing the deadlock. The issue was reproducible for one system user, while not for another...

I tried with a custom build with more debug messages around sizedwaitgroup's calls and mutex locks/unlocks, against latest go version 1.23, but I couldn't reproduce yet.

nickTh-tyn avatar Jan 29 '25 10:01 nickTh-tyn

Maybe sizedwaitgroups library's pr https://github.com/remeh/sizedwaitgroup/pull/9 is related to this...?

nickTh-tyn avatar Jan 29 '25 10:01 nickTh-tyn

Hi! Luckily we met again the deadlock. (This time all of these branches were enabled, but almost all environments were already updated.) With SIGQUIT, we got below stack-traces :

` SIGQUIT: quit PC=0x403f2e m=0 sigcode=0

goroutine 0 [idle]: runtime/internal/syscall.Syscall6() /usr/lib/go-1.21/src/runtime/internal/syscall/asm_linux_amd64.s:36 +0xe fp=0x7ffd228d52e0 sp=0x7ffd228d52d8 pc=0x403f2e runtime/internal/syscall.EpollWait(0x0?, {0x7ffd228d53a8?, 0x0?, 0x0?}, 0x0?, 0x0?) /usr/lib/go-1.21/src/runtime/internal/syscall/syscall_linux.go:55 +0x45 fp=0x7ffd228d5330 sp=0x7ffd228d52e0 pc=0x403ea5 runtime.netpoll(0xc000026000?) /usr/lib/go-1.21/src/runtime/netpoll_epoll.go:118 +0xcb fp=0x7ffd228d59c0 sp=0x7ffd228d5330 pc=0x431aab runtime.findRunnable() /usr/lib/go-1.21/src/runtime/proc.go:3186 +0x825 fp=0x7ffd228d5ad0 sp=0x7ffd228d59c0 pc=0x43d825 runtime.schedule() /usr/lib/go-1.21/src/runtime/proc.go:3577 +0xb1 fp=0x7ffd228d5b08 sp=0x7ffd228d5ad0 pc=0x43e991 runtime.park_m(0xc000007ba0?) /usr/lib/go-1.21/src/runtime/proc.go:3740 +0x11f fp=0x7ffd228d5b50 sp=0x7ffd228d5b08 pc=0x43ee9f runtime.mcall() /usr/lib/go-1.21/src/runtime/asm_amd64.s:458 +0x4e fp=0x7ffd228d5b68 sp=0x7ffd228d5b50 pc=0x464fee

goroutine 1 [semacquire]: runtime.gopark(0xc00001c2a0?, 0xc00001c2a0?, 0x60?, 0x40?, 0xc00018c750?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc000145450 sp=0xc000145430 pc=0x4385ce runtime.goparkunlock(...) /usr/lib/go-1.21/src/runtime/proc.go:404 runtime.semacquire1(0xc00005d6d8, 0x30?, 0x1, 0x0, 0xcb?) /usr/lib/go-1.21/src/runtime/sema.go:160 +0x218 fp=0xc0001454b8 sp=0xc000145450 pc=0x448f78 sync.runtime_Semacquire(0xc00018c780?) /usr/lib/go-1.21/src/runtime/sema.go:62 +0x25 fp=0xc0001454f0 sp=0xc0001454b8 pc=0x463625 sync.(*WaitGroup).Wait(0xc0001455a0?) /usr/lib/go-1.21/src/sync/waitgroup.go:116 +0x48 fp=0xc000145518 sp=0xc0001454f0 pc=0x471068 github.com/remeh/sizedwaitgroup.(*SizedWaitGroup).Wait(...) /home/andpaul/dev/go/src/github.com/xorpaul/g10k/vendor/github.com/remeh/sizedwaitgroup/sizedwaitgroup.go:83 main.resolvePuppetEnvironment(0x0, {0x0, 0x0}) /home/andpaul/dev/go/src/github.com/xorpaul/g10k/puppetfile.go:214 +0x37b fp=0xc000145698 sp=0xc000145518 pc=0x6ca63b main.main() /home/andpaul/dev/go/src/github.com/xorpaul/g10k/g10k.go:291 +0xb87 fp=0xc000145f40 sp=0xc000145698 pc=0x6c2507 runtime.main() /usr/lib/go-1.21/src/runtime/proc.go:267 +0x2bb fp=0xc000145fe0 sp=0xc000145f40 pc=0x43817b runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000145fe8 sp=0xc000145fe0 pc=0x466e61

goroutine 2 [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc00003afa8 sp=0xc00003af88 pc=0x4385ce runtime.goparkunlock(...) /usr/lib/go-1.21/src/runtime/proc.go:404 runtime.forcegchelper() /usr/lib/go-1.21/src/runtime/proc.go:322 +0xb3 fp=0xc00003afe0 sp=0xc00003afa8 pc=0x438453 runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00003afe8 sp=0xc00003afe0 pc=0x466e61 created by runtime.init.6 in goroutine 1 /usr/lib/go-1.21/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc00003b778 sp=0xc00003b758 pc=0x4385ce runtime.goparkunlock(...) /usr/lib/go-1.21/src/runtime/proc.go:404 runtime.bgsweep(0x0?) /usr/lib/go-1.21/src/runtime/mgcsweep.go:280 +0x94 fp=0xc00003b7c8 sp=0xc00003b778 pc=0x424754 runtime.gcenable.func1() /usr/lib/go-1.21/src/runtime/mgc.go:200 +0x25 fp=0xc00003b7e0 sp=0xc00003b7c8 pc=0x419b05 runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00003b7e8 sp=0xc00003b7e0 pc=0x466e61 created by runtime.gcenable in goroutine 1 /usr/lib/go-1.21/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]: runtime.gopark(0xc00005a000?, 0x7dec98?, 0x1?, 0x0?, 0xc000006d00?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc00003bf70 sp=0xc00003bf50 pc=0x4385ce runtime.goparkunlock(...) /usr/lib/go-1.21/src/runtime/proc.go:404 runtime.(*scavengerState).park(0x9cb4a0) /usr/lib/go-1.21/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00003bfa0 sp=0xc00003bf70 pc=0x422029 runtime.bgscavenge(0x0?) /usr/lib/go-1.21/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00003bfc8 sp=0xc00003bfa0 pc=0x4225bc runtime.gcenable.func2() /usr/lib/go-1.21/src/runtime/mgc.go:201 +0x25 fp=0xc00003bfe0 sp=0xc00003bfc8 pc=0x419aa5 runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00003bfe8 sp=0xc00003bfe0 pc=0x466e61 created by runtime.gcenable in goroutine 1 /usr/lib/go-1.21/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]: runtime.gopark(0x40c9fe?, 0x7f882d40af20?, 0x0?, 0x0?, 0x0?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc00003a620 sp=0xc00003a600 pc=0x4385ce runtime.runfinq() /usr/lib/go-1.21/src/runtime/mfinal.go:193 +0x107 fp=0xc00003a7e0 sp=0xc00003a620 pc=0x418b27 runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00003a7e8 sp=0xc00003a7e0 pc=0x466e61 created by runtime.createfing in goroutine 1 /usr/lib/go-1.21/src/runtime/mfinal.go:163 +0x3d

goroutine 17 [sync.Mutex.Lock]: runtime.gopark(0x7f882d4083d0?, 0x7f882d3ff108?, 0x0?, 0x4c?, 0x7f882d409438?) /usr/lib/go-1.21/src/runtime/proc.go:398 +0xce fp=0xc000074930 sp=0xc000074910 pc=0x4385ce runtime.goparkunlock(...) /usr/lib/go-1.21/src/runtime/proc.go:404 runtime.semacquire1(0x9fa584, 0x0?, 0x3, 0x1, 0xea?) /usr/lib/go-1.21/src/runtime/sema.go:160 +0x218 fp=0xc000074998 sp=0xc000074930 pc=0x448f78 sync.runtime_SemacquireMutex(0x8e?, 0x3b?, 0x3b?) /usr/lib/go-1.21/src/runtime/sema.go:77 +0x25 fp=0xc0000749d0 sp=0xc000074998 pc=0x4636e5 sync.(*Mutex).lockSlow(0x9fa580) /usr/lib/go-1.21/src/sync/mutex.go:171 +0x15d fp=0xc000074a20 sp=0xc0000749d0 pc=0x46f9bd sync.(*Mutex).Lock(...) /usr/lib/go-1.21/src/sync/mutex.go:90 main.resolvePuppetEnvironment.func1.1({0xc00020c02c, 0x12}, {{0xc00001a0c0, 0x3b}, {0xc000012cc0, 0xe}, {0xc000012cf0, 0x5}, {0x0, 0x0}, ...}, ...) /home/andpaul/dev/go/src/github.com/xorpaul/g10k/puppetfile.go:152 +0x832 fp=0xc000074f28 sp=0xc000074a20 pc=0x6cc312 main.resolvePuppetEnvironment.func1.3() /home/andpaul/dev/go/src/github.com/xorpaul/g10k/puppetfile.go:197 +0x66 fp=0xc000074fe0 sp=0xc000074f28 pc=0x6cbaa6 runtime.goexit() /usr/lib/go-1.21/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x466e61 created by main.resolvePuppetEnvironment.func1 in goroutine 6 /home/andpaul/dev/go/src/github.com/xorpaul/g10k/puppetfile.go:119 +0xddc

rax 0xfffffffffffffffc rbx 0x4 rcx 0x403f2e rdx 0x80 rdi 0x4 rsi 0x7ffd228d53a8 rbp 0x7ffd228d5320 rsp 0x7ffd228d52d8 r8 0x0 r9 0x0 r10 0xffffffffffffffff r11 0x246 r12 0x7ffd228d53b0 r13 0x7947942882882 r14 0x9cb880 r15 0x1f rip 0x403f2e rflags 0x246 cs 0x33 fs 0x0 gs 0x0 ERROR: Exiting. `

nickTh-tyn avatar Feb 07 '25 15:02 nickTh-tyn

We've never encountered this before.

How many branches are you updating? Did you change the default count of any workers? -maxworker or -maxextractworker?

But yes, given your stacktrace it could be caused by either the remeh/sizedwaitgroup or the mutex.Lock() in the puppetfile.go

I still need to investigate this further, but right now I can't even get my build to work anymore as it does not terminate even with the -version parameter πŸ₯΄

$ export CGO_ENABLED=0 ; export BUILDTIME=$(date -u '+%Y-%m-%d_%H:%M:%S') ; export BUILDVERSION=$(git describe --tags) ; /usr/local/go/bin/go build -ldflags "-X main.buildtime=${BUILDTIME} -X main.buildversion=${BUILDVERSION}" && date && ./g10k -version
Fri Feb  7 05:27:07 PM CET 2025
g10k  v0.9.9-9-g71ddd11  Build time: 2025-02-07_16:27:07 UTC

xorpaul avatar Feb 07 '25 16:02 xorpaul

We're seeing hangs in Debian as well. I can't reproduce them in the Docker container (which is based on alpine and golang 1.17). Debian's package is built with golang 1.23, so I wonder if that might be the issue here. But I can't reproduce by patching that Dockerfile to switch from 1.17 to 1.23...

So there might be something wrong with our build toolchain, but i figured i would mention it here since others are having similar issues...

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1093015

anarcat avatar Feb 09 '25 21:02 anarcat

and i just tried to reproduce in a debian container and I can't reproduce.

so there's something on my desktop, or environment or something that triggers this.

i'll also note that the package from debian bookworm (0.9.7) doesn't have that issue.

anarcat avatar Feb 09 '25 21:02 anarcat

Hi! @xorpaul no change at max workers or similar setting. Just to be clear, the issue is not always reproducible. For example at same host, for one user there is virtually no problem, while for another it hangs almost always.

I build and test, after replacing sync.Mutex with delock.Mutex (build against golang 1.23). Hopefully I'll come back with more info.

nickTh-tyn avatar Feb 10 '25 14:02 nickTh-tyn

I build and test, after replacing sync.Mutex with delock.Mutex (build against golang 1.23). Hopefully I'll come back with more info.

do you have an actual patch showing what you did there? i'm curious..

anarcat avatar Feb 10 '25 19:02 anarcat

Hi @anarcat ! That's the changes: https://github.com/xorpaul/g10k/commit/783e104e884b71288a05324eaa8fd8c6349c3385 (note that I kept same code lines, so any debugging info applies to original code, as well)

Build by:

` go get

go mod vendor

export CGO_ENABLED=0

export BUILDTIME=$(date -u '+%Y-%m-%d_%H:%M:%S')

export BUILDVERSION=$(git describe --tags)

go build -ldflags "-X main.buildtime=${BUILDTIME} -X main.buildversion=${BUILDVERSION}"

./g10k -version `

nickTh-tyn avatar Feb 11 '25 09:02 nickTh-tyn

i see this hang with that patch as well.

note that i don't even need to have a Puppetfile or any special environment: just running g10k -version in an empty temp directory triggers the issue.

what i find really puzzling is that i can't reproduce with the exact same package, in a clean Debian container. this means it's something with the glibc runtime or something, because it's not the environment either, env -i still triggers it.

what a weird bug.

here's my backtrace:

anarcat@angela:/t/tmp.9qpX0l7YLs> g10k -version
g10k  0.9.9  Build time: 1739301748 UTC
^\SIGQUIT: quit
PC=0x4748a1 m=0 sigcode=128

goroutine 0 gp=0xa59600 m=0 mp=0xa5a620 [idle]:
runtime.futex(0xa5a760, 0x80, 0x0, 0x0, 0x0, 0x0)
	runtime/sys_linux_amd64.s:557 +0x21 fp=0x7fff6bf7a858 sp=0x7fff6bf7a850 pc=0x4748a1
runtime.futexsleep(0x7fff6bf7a8d0?, 0x40c030?, 0x47448d?)
	runtime/os_linux.go:69 +0x30 fp=0x7fff6bf7a8a8 sp=0x7fff6bf7a858 pc=0x431290
runtime.notesleep(0xa5a760)
	runtime/lock_futex.go:170 +0x87 fp=0x7fff6bf7a8e0 sp=0x7fff6bf7a8a8 pc=0x40c1c7
runtime.mPark(...)
	runtime/proc.go:1866
runtime.stopm()
	runtime/proc.go:2886 +0x8c fp=0x7fff6bf7a910 sp=0x7fff6bf7a8e0 pc=0x43c5cc
runtime.findRunnable()
	runtime/proc.go:3623 +0xd5c fp=0x7fff6bf7aa88 sp=0x7fff6bf7a910 pc=0x43e03c
runtime.schedule()
	runtime/proc.go:3996 +0xb1 fp=0x7fff6bf7aac0 sp=0x7fff6bf7aa88 pc=0x43f111
runtime.goschedImpl(0xc0000061c0, 0x80?)
	runtime/proc.go:4137 +0x185 fp=0x7fff6bf7ab10 sp=0x7fff6bf7aac0 pc=0x43f6e5
runtime.gosched_m(0xc0000061c0?)
	runtime/proc.go:4142 +0x15 fp=0x7fff6bf7ab30 sp=0x7fff6bf7ab10 pc=0x43f855
runtime.mcall()
	runtime/asm_amd64.s:459 +0x4e fp=0x7fff6bf7ab48 sp=0x7fff6bf7ab30 pc=0x470c2e

goroutine 1 gp=0xc0000061c0 m=nil [runnable]:
runtime.Gosched()
	runtime/proc.go:353 +0x14 fp=0xc0000dd628 sp=0xc0000dd610 pc=0x437d14
internal/runtime/exithook.Run(0x0)
	internal/runtime/exithook/hooks.go:61 +0x36 fp=0xc0000dd688 sp=0xc0000dd628 pc=0x477076
runtime.runExitHooks(...)
	runtime/proc.go:320
os.runtime_beforeExit(0xc0000a0410?)
	runtime/proc.go:307 +0x14 fp=0xc0000dd6a0 sp=0xc0000dd688 pc=0x46af34
os.Exit(0x0)
	os/proc.go:75 +0x2e fp=0xc0000dd6c0 sp=0xc0000dd6a0 pc=0x4bff4e
main.main()
	github.com/xorpaul/g10k/g10k.go:259 +0x91e fp=0xc0000ddf50 sp=0xc0000dd6c0 pc=0x709d9e
runtime.main()
	runtime/proc.go:272 +0x28b fp=0xc0000ddfe0 sp=0xc0000ddf50 pc=0x43794b
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000ddfe8 sp=0xc0000ddfe0 pc=0x472ae1

goroutine 2 gp=0xc000006700 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x46b02e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.forcegchelper()
	runtime/proc.go:337 +0xb3 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x437c93
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x472ae1
created by runtime.init.7 in goroutine 1
	runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000006c40 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000065780 sp=0xc000065760 pc=0x46b02e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.bgsweep(0xc00001e180)
	runtime/mgcsweep.go:277 +0x94 fp=0xc0000657c8 sp=0xc000065780 pc=0x422734
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x417045
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x472ae1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000006e00 m=nil [GC scavenge wait]:
runtime.gopark(0xc00001e180?, 0x83f970?, 0x1?, 0x0?, 0xc000006e00?)
	runtime/proc.go:424 +0xce fp=0xc000065f78 sp=0xc000065f58 pc=0x46b02e
runtime.goparkunlock(...)
	runtime/proc.go:430
runtime.(*scavengerState).park(0xa59200)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000065fa8 sp=0xc000065f78 pc=0x420169
runtime.bgscavenge(0xc00001e180)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc000065fc8 sp=0xc000065fa8 pc=0x4206dc
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000065fe0 sp=0xc000065fc8 pc=0x416fe5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x472ae1
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 5 gp=0xc000007340 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xc0000646d8?)
	runtime/proc.go:424 +0xce fp=0xc000064620 sp=0xc000064600 pc=0x46b02e
runtime.runfinq()
	runtime/mfinal.go:193 +0x107 fp=0xc0000647e0 sp=0xc000064620 pc=0x4160c7
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000647e8 sp=0xc0000647e0 pc=0x472ae1
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:163 +0x3d

goroutine 6 gp=0xc0000076c0 m=nil [chan receive]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:424 +0xce fp=0xc000066718 sp=0xc0000666f8 pc=0x46b02e
runtime.chanrecv(0xc000022380, 0x0, 0x1)
	runtime/chan.go:639 +0x41c fp=0xc000066790 sp=0xc000066718 pc=0x406c9c
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:489 +0x12 fp=0xc0000667b8 sp=0xc000066790 pc=0x406852
runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
	runtime/mgc.go:1781
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1784 +0x2f fp=0xc0000667e0 sp=0xc0000667b8 pc=0x419e6f
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x472ae1
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1779 +0x96

rax    0xca
rbx    0x0
rcx    0x4748a3
rdx    0x0
rdi    0xa5a760
rsi    0x80
rbp    0x7fff6bf7a898
rsp    0x7fff6bf7a850
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7fff6bf7a8d0
r13    0x1
r14    0xa59600
r15    0xf
rip    0x4748a1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

what's puzzling there is that the backtrace has practically nothing inside g10k itself, it's all in stdlib. looking at main() there's nothing but flag parsing there... what could be possibly triggering this other than a fault inside the golang runtime itself?

anarcat avatar Feb 11 '25 20:02 anarcat

Yeah, I'm also confused by this. I was able to compile and run it at the end of 2024, but now I can't.

Maybe downgrading the golang version will help?

xorpaul avatar Feb 11 '25 20:02 xorpaul

well, it's interesting. the bug doesn't trigger if you remove everything up to the flag parser and all files but config.go g10k.go and helper.go.

i'll keep bisecting like this if i can figure out which source file is triggering the deadlock.

update: that's about as far as I can go... all other files are too coupled with each other for this to build at all.

anarcat avatar Feb 11 '25 20:02 anarcat

for what it's worth, i can't reproduce this by building g10k with golang 1.19 on debian trixie. so it definitely sounds like a regression somewhere in golang, between 1.19 and 1.23.

i'd recommend hooking this up to CI here in a matrix that would test those different versions.

anarcat avatar Feb 11 '25 21:02 anarcat

This looks like a different problem (but you never know...), like a waiting channel that blocks the cleanup of a top level structure. (Googling unique_runtime_registerUniqueMapCleanup and chanrecv, is not of a great help...)

I could not reproduce anarcat's issue building either with CGO_ENABLED=1 or CGO_ENABLED=0 (static binary). Tried:

  • Ubuntu 20.04, go 1.23.5 (manual install)
  • Debian 12, go 1.19.8 ( golang deb package 2:1.19~1 )
  • Docker image of Debian Trixie, go 1.23.5 ( golang deb package 2:1.23~2 )

I also tried the github latest release's binary (go 1.21) inside trixie container, like:

` make clean cp /path/to/official/release/g10k ./

cat > Dockerfile.trixie <<'EOF' FROM debian:trixie-slim COPY ./g10k /usr/bin/ ENTRYPOINT [ "/usr/bin/g10k" ] CMD ["-version"] EOF

podman build -t g10k:test -f Dockerfile.trixie . podman run --rm g10k:test ` @anarcat please share more info of your environment (go env) and maybe output of strace ./g10k -version ps aux | grep [g]10k pstree -lpnghs $G10K_PID lsof -p $G10K_PID etc, etc

Is your issue 100% reproducible, or looks like a bit random ?

nickTh-tyn avatar Feb 12 '25 10:02 nickTh-tyn

strace ./g10k -version

that thing is massive, mainly a lot of futex failures

g10k-strace.txt

ps aux | grep [g]10k

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
anarcat   255071  191  0.0 1749112 8432 pts/4    Sl+  08:57   0:26 g10k -version

pstree -lpnghs $G10K_PID

anarcat@angela:~> pstree -lpnghs 255071                                                                                                                   (main)
systemd(1,1)───systemd(3829,3829)───sh(237676,237675)───foot(237677,237675)───fish(237681,237681)───g10k(255071,255071)─┬─{g10k}(255072,255071)
                                                                                                                        β”œβ”€{g10k}(255073,255071)
                                                                                                                        β”œβ”€{g10k}(255074,255071)
                                                                                                                        β”œβ”€{g10k}(255075,255071)
                                                                                                                        β”œβ”€{g10k}(255076,255071)
                                                                                                                        β”œβ”€{g10k}(255077,255071)
                                                                                                                        └─{g10k}(255090,255071)

lsof -p $G10K_PID

anarcat@angela:~> lsof -p 255071                                                                                                                          (main)
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
COMMAND    PID    USER  FD      TYPE DEVICE SIZE/OFF     NODE NAME
g10k    255071 anarcat cwd       DIR  253,3   147456 26214652 /home/anarcat/dist
g10k    255071 anarcat rtd       DIR  253,1     4096        2 /
g10k    255071 anarcat txt       REG  253,1  6690752   266450 /usr/bin/g10k
g10k    255071 anarcat mem       REG  253,1  2003408   310441 /usr/lib/x86_64-linux-gnu/libc.so.6
g10k    255071 anarcat mem       REG  253,1   222968   307843 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
g10k    255071 anarcat   0u      CHR  136,4      0t0        7 /dev/pts/4
g10k    255071 anarcat   1u      CHR  136,4      0t0        7 /dev/pts/4
g10k    255071 anarcat   2u      CHR  136,4      0t0        7 /dev/pts/4
g10k    255071 anarcat   3w      CHR    5,0      0t0       11 /dev/tty
g10k    255071 anarcat   4u  a_inode   0,16        0     2072 [eventpoll:3,5,6]
g10k    255071 anarcat   5u  a_inode   0,16        0     2072 [eventfd:104]
g10k    255071 anarcat   6w      CHR    5,0      0t0       11 /dev/tty

etc, etc

what else?

Is your issue 100% reproducible, or looks like a bit random ?

it's 100% reproducible on this desktop. as i said earlier, i can't reproduce inside a debian container, so there's really something off with this issue.

it might not be the same as the issue described here. i just hooked up here because it looked similar.

anarcat avatar Feb 12 '25 14:02 anarcat

oh, and here's a "normal" strace run in a debian unstable container, where the issue is not reproduced.

g10k-strace-normal.txt

anarcat avatar Feb 12 '25 14:02 anarcat

oh wow.

it's a terminal issue.

if i run g10k -version inside gnome-terminal, the bug doesn't trigger. but if i run g10k inside foot or alacritty, then the process hangs.

i got the hint when i found this in the strace:

254913 openat(AT_FDCWD, "/dev/tty", O_WRONLY|O_CLOEXEC <unfinished ...>

that shoulid normally be:

[pid  3493] openat(AT_FDCWD, "/dev/tty", O_WRONLY|O_CLOEXEC) = 6

this is presumably why it can't be reproduced inside podman, because it messes with the terminal.

anarcat avatar Feb 12 '25 14:02 anarcat

Hi! Thanks for the straces, I'll take a look. I think the slightly different runtime environment (even different terminals) affects the race conditions and the appearance of the bug. I added to my fork few more commits, embedding also pprof, what is supposed to show something, when g10k is hanging, by web browsing to http://localhost:6060. You can even try the build into a debian sid container: https://github.com/nickTh-tyn/g10k/blob/master/Dockerfile.sid-build

My last suspicion was on the promoted mutex of type LatestForgeModules ( https://github.com/xorpaul/g10k/blob/v0.9.9/g10k.go#L67 ), so I make it like typical struct var ( https://github.com/xorpaul/g10k/commit/4f36baa756f40551b8ce22473a8b4ee04e9bfd34 )...

nickTh-tyn avatar Feb 12 '25 14:02 nickTh-tyn

@nickTh-tyn did you see the part where i can't reproduce inside gnome-terminal but i can under foot and alacritty? have you tried reproducing that?

anarcat avatar Feb 12 '25 15:02 anarcat

@anarcat you are right that's the terminal. It was hanging also at my workstation, when running under xterm (gnome-shell, terminator had no issue). High cpu usage and pprof shows that time is spend in lock/unlock mostly.

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=2 ... (pprof) top10 Showing nodes accounting for 1940ms, 75.19% of 2580ms total Dropped 10 nodes (cum <= 12.90ms) Showing top 10 nodes out of 46 flat flat% sum% cum cum% 360ms 13.95% 13.95% 430ms 16.67% runtime.lock2 340ms 13.18% 27.13% 480ms 18.60% runtime.casgstatus 290ms 11.24% 38.37% 350ms 13.57% runtime.unlock2 220ms 8.53% 46.90% 220ms 8.53% runtime.nanotime (inline) 180ms 6.98% 53.88% 180ms 6.98% runtime.futex 170ms 6.59% 60.47% 170ms 6.59% internal/runtime/atomic.(*Int32).CompareAndSwap 120ms 4.65% 65.12% 1530ms 59.30% runtime.schedule 110ms 4.26% 69.38% 120ms 4.65% runtime.globrunqget 90ms 3.49% 72.87% 980ms 37.98% runtime.findRunnable 60ms 2.33% 75.19% 60ms 2.33% runtime.(*mLockProfile).recordUnlock

But github release (go 1.21) has no issue (at least at my workstation).

nickTh-tyn avatar Feb 12 '25 15:02 nickTh-tyn

that is the weirdest bug this month.

anarcat avatar Feb 12 '25 15:02 anarcat

For me, under xterm, the /dev/tty openat syscalls are successful and I see only futex failures:

... futex(0xc00004e848, FUTEX_WAKE_PRIVATE, 1) = 1 fcntl(1, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) fcntl(2, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) openat(AT_FDCWD, "/dev/tty", O_WRONLY|O_CLOEXEC) = 3 fcntl(3, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) fcntl(3, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0 epoll_create1(EPOLL_CLOEXEC) = 4 eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 5 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=10983520, u64=10983520}}) = 0 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3456106497, u64=9221515660956270593}}) = 0 fcntl(3, F_GETFL) = 0x8801 (flags O_WRONLY|O_NONBLOCK|O_LARGEFILE) fcntl(3, F_SETFL, O_WRONLY|O_LARGEFILE) = 0 ioctl(3, TIOCGWINSZ, {ws_row=64, ws_col=245, ws_xpixel=1474, ws_ypixel=836}) = 0 openat(AT_FDCWD, "/dev/tty", O_WRONLY|O_CLOEXEC) = 6 fcntl(6, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0 epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3437756417, u64=9221515660937920513}}) = 0 fcntl(6, F_GETFL) = 0x8801 (flags O_WRONLY|O_NONBLOCK|O_LARGEFILE) fcntl(6, F_SETFL, O_WRONLY|O_LARGEFILE) = 0 ioctl(6, TIOCGWINSZ, {ws_row=64, ws_col=245, ws_xpixel=1474, ws_ypixel=836}) = 0 ioctl(1, TCGETS, 0xc0000a4dac) = -1 ENOTTY (Inappropriate ioctl for device) futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 write(1, "g10k v0.9.9 Build time: 2025-0"..., 50g10k v0.9.9 Build time: 2025-02-12_15:37:29 UTC ) = 50 futex(0xc00004e848, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000100148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc00004e848, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xa59a60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) ...

nickTh-tyn avatar Feb 12 '25 16:02 nickTh-tyn

BTW I tried to refresh all libraries (go list -mod=mod -u -m all; go get -u; go mod vendor; go clean -cache; make clean; ...), as I was suspecting outdated dependencies, like golang.org/x/term or github.com/mattn/go-isatty, but the hanging issue, at app's exit, under xterm, was not fixed...

nickTh-tyn avatar Feb 13 '25 08:02 nickTh-tyn

And I confirm that, even with updated dependencies, bookworm static binary works fine (go 1.19.8), trixie/sid (go 1.23.6) is problematic (I pushed sample Dockerfiles for building), in relation to exit issue.

nickTh-tyn avatar Feb 13 '25 10:02 nickTh-tyn

@xorpaul @anarcat I would propose to update all dependencies to latest versions, tag and post a new github release (with go 1.19 - 1.21, which was tested) and in parallel open a new issue about the hanging at exit (g10k -version) for visibility. So it would be easier to link to possibly new issues, with new go versions, at dependencies themselves.

nickTh-tyn avatar Feb 13 '25 10:02 nickTh-tyn

I can now finally compile and execute the binary again without any problems, no matter the terminal I use. For me everything was fixed with the latest 1.24 golang release, which was only release two days ago 2025-02-11 https://go.dev/doc/devel/release

$ /tmp/go/bin/go version
go version go1.23.6 linux/amd64
$ export CGO_ENABLED=0 ; /tmp/go/bin/go build -ldflags "-X main.buildtime=${BUILDTIME} -X main.buildversion=${BUILDVERSION}" && date && ./g10k --version
Thu Feb 13 01:52:43 PM CET 2025
g10k    Build time:  UTC
^C hangs indefinitely
$ go version
go version go1.24.0 linux/amd64
$ export CGO_ENABLED=0 ; go build -ldflags "-X main.buildtime=${BUILDTIME} -X main.buildversion=${BUILDVERSION}" && date && ./g10k --version
Thu Feb 13 01:52:57 PM CET 2025
g10k    Build time:  UTC
$ returns immediately as it should

Can you two also try it with the newest golang version?

xorpaul avatar Feb 13 '25 12:02 xorpaul

Hi! It works with go 1.24.0.

nickTh-tyn avatar Feb 13 '25 13:02 nickTh-tyn

Ok, I've fixed all the github CI tests and updated the vendor/modules/dependencies and build a new release using go 1.24.0 https://github.com/xorpaul/g10k/releases/tag/v0.9.10

@nickTh-tyn Hopefully this should also help with you stuck g10k runs

xorpaul avatar Feb 14 '25 15:02 xorpaul

that is great news, but i'm sorry to announce i can't reproduce the fix in debian: building against 1.24 doesn't fix the problem, so perhaps there's some other dependency i need to update here as well.

sigh.

anarcat avatar Feb 14 '25 15:02 anarcat

Thanks a lot @xorpaul ! There are good chances that's the same issue, as my colleague, that was facing the problem, had switched to konsole recently... We'll start using your latest release to check.

nickTh-tyn avatar Feb 14 '25 15:02 nickTh-tyn

@anarcat Did you try to build it against the current master branch? I've also updates all dependecies/modules.

xorpaul avatar Feb 17 '25 13:02 xorpaul