SIGSEGV or SIGBUS when validating
I've got strange issue when trying to validate large number (more than 100) of large xml files (20-130mb). It looks like this
unexpected fault address 0xc0068c3000
fatal error: fault
[signal SIGBUS: bus error code=0x4 addr=0xc0068c3000 pc=0x55dffd]
or
SIGSEGV: segmentation violation
PC=0x7f19813d22222e3 m=43 sigcode=1
signal arrive during cgo execution
Stacktrace always the same
runtime.cgocall(0x868de0, 0xc000739698)
/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000739670 sp=0xc000739638 pc=0x40a68b
github.com/terminalstatic/go-xsd-validate._C2func_cParseDoc(0x7fe986193010, 0x5e6c74f, 0x1)
_cgo_gotypes.go:254 +0x57 fp=0xc000739698 sp=0xc000739670 pc=0x538d57
github.com/terminalstatic/go-xsd-validate.parseXmlMem.func3(0x7fe986193010?, {0xc020aa0000?, 0x5e6c74f, 0x757e000?}, 0x1?)
/builds/app/.go/pkg/mod/github.com/terminalstatic/[email protected]/libxml2.go:433 +0x5a fp=0xc0007396e8 sp=0xc000739698 pc=0x5397fa
github.com/terminalstatic/go-xsd-validate.parseXmlMem({0xc020aa0000, 0x5e6c74f, 0x757e000}, 0xfe?)
/builds/app/.go/pkg/mod/github.com/terminalstatic/[email protected]/libxml2.go:433 +0xa5 fp=0xc000739780 sp=0xc0007396e8 pc=0x5395e5
github.com/terminalstatic/go-xsd-validate.NewXmlHandlerMem({0xc020aa0000?, 0xc00f542010?, 0x0?}, 0x1d?)
/builds/app/.go/pkg/mod/github.com/terminalstatic/[email protected]/validate_xsd.go:94 +0x29 fp=0xc0007397d8 sp=0xc000739780 pc=0x53a3e9
Problem is that I can't reproduce it on my machine with same files and I don't have access to server, where error occurs.
Error can happen at any time, on any file, can't reproduce it on exact one file or set of files. Only on bunch of xml's. Error can happen at second file or at 29th, no pattern.
How can I debug or reproduce error? Maybe there is a bug C code?
I've been away from go and C for quite some time so ymmv. If I understand it right it only happens on a particular machine? Does the architecture of the machine differ from your dev machine? What OS ist this and your dev machine running? Do you cross compile? Which version of go are you using? Did you try a different (possibly older) one?
I think, it depends on data, rather than hardware. Because we have tested several setups (everywhere go1.21):
- my local mac os (once happened)
- dev servers, centos (never happened)
- production servers, centos
We do not cross compile, binary builds on same os where it runs. And we can't try older versions, because we need updates from #12
Ah, forgot to mention, that we never run production data on local or dev machines
Understandable but really hard to debug then. Maybe far fetched but when you handle heaps of data concurrently (do you?), did you play around with go's and memory settings?
Yeah, we process xml files concurrently in workers, it may affect?
play around with go's and memory settings?
No, default everywhere.
Just to make sure, you are freeing correctly and are using Init and cleanup only once?
I should. Init on startup and then in each worker I handle files, validation and cleanup.
Cleanup or Free? Cleanup should only be called when program or part of program exits ...
Cleanup on program exit and Free on workers' job done.
Can't really help immediately then I guess. Only thing I could think of is to fake huge xml requests and concurrency using go 1.21 but unfortunately I don't really have spare time currently to set this up.
Just a short notice, I tested this a little and I still have the suspicion that the cause could be go's memory management. You could try to play around with the InitWithGc function time parameter and the GOMEMLIMIT env variable. You could also check your systems ulimit settings. You could probably also try to delay worker execution and/or turn the concurrency down.
Funny thing is my mac with go 1.21 fails pretty soon when testing with concurrency of 100 and 100MB xml file ... my linux machine with go1.17 just chuckles along, with the restriction that with a concurrency of 100 it sometimes just stalls because of cpu load. I'd personally give it a lower concurrency setting, at least on my hardware around 20 performs rather well.
I will give this a spin a little later on my linux machine with go 1.21 and check if it makes a difference.
Does it fails with same stacktrace?
On mac it just fails with a killed signal without a trace, on linux it never fails.