dd-trace-go icon indicating copy to clipboard operation
dd-trace-go copied to clipboard

Profiler Fatal Error: signal SIGSEGV: segmentation violation

Open marcosngomezi opened this issue 3 years ago • 19 comments

Hi all! We are having the following fatal error sporadically(every day or 2 days):

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0xb105c6]
goroutine 82 [running]:
runtime.throw({0xe506af, 0x4})
/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc001284e80 sp=0xc001284e50 pc=0x437811
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:742 +0x2f6 fp=0xc001284ed0 sp=0xc001284e80 pc=0x44da56
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001285368, 0x736d203328202e38)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:233 +0x46 fp=0xc001284fe8 sp=0xc001284ed0 pc=0xb105c6
github.com/google/pprof/profile.(*profileMerger).mapSample(0xc001285368, 0xc002b7c180)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:164 +0x249 fp=0xc001285228 sp=0xc001284fe8 pc=0xb0f849
github.com/google/pprof/profile.Merge({0xc001285b08, 0x2, 0x4})
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:75 +0x4e7 fp=0xc001285a78 sp=0xc001285228 pc=0xb0f387
gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pprofutils.Delta.Convert({{0xc000220800, 0x2a790, 0x0}}, 0xc00019ea00, 0xc00027a640)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/internal/pprofutils/delta.go:61 +0x146 fp=0xc001285b28 sp=0xc001285a78 pc=0xb1f186
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).deltaProfile(0xc0002b25f0, {0x0, {0xe4fb6e, 0x4}, {0xe55cb2, 0xa}, 0xc00000e6c0, 0xe96b50}, {0xc0037ae000, 0x2a790, ...})
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:265 +0x14d fp=0xc001285bb0 sp=0xc001285b28 pc=0xb236cd
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc0002b25f0, 0x0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:221 +0x35e fp=0xc001285dc0 sp=0xc001285bb0 pc=0xb2313e
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc0002b25f0, 0xc000510660)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:191 +0x287 fp=0xc001285f78 sp=0xc001285dc0 pc=0xb25947
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1()
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:163 +0xda fp=0xc001285fe0 sp=0xc001285f78 pc=0xb255ba
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc001285fe8 sp=0xc001285fe0 pc=0x46aaa1
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:158 +0xf2

We are using [email protected] and golang1.17 with alpine3.15

marcosngomezi avatar Apr 19 '22 13:04 marcosngomezi

Thank you for the report! I'll look into this ASAP.

nsrip-dd avatar Apr 19 '22 13:04 nsrip-dd

So I haven't yet reproduced the crash, and I suspect it may be difficult on your end since it happens infrequently. However, one thing jumps out to me in the stack trace you've shared:

github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001285368, 0x736d203328202e38)

The two arguments there are the *profileMerger (the method receiver) and a *Location, both from the github.com/google/pprof/profile package. Now, the first argument looks like the address of a heap-allocated data structure, similar to the other arguments seen in the call stack. But the second argument (0x736d203328202e38) definitely doesn't look right. (Maybe coincidentally, it's a string of printable ASCII characters: sm\n3(\n.8). This makes me wonder if there's something in the process that's clobbering random memory.

Does the crash always happen in exactly the same location? Do you have any other stack traces you can share?

nsrip-dd avatar Apr 19 '22 19:04 nsrip-dd

Yes,It happens always on the same lines:

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0xad63c6]
goroutine 43 [running]:
runtime.throw({0xe00062, 0x4})
/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc001452e80 sp=0xc001452e50 pc=0x437771
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:742 +0x2f6 fp=0xc001452ed0 sp=0xc001452e80 pc=0x44d9b6
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001453368, 0x736d203528202e38)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:233 +0x46 fp=0xc001452fe8 sp=0xc001452ed0 pc=0xad63c6
github.com/google/pprof/profile.(*profileMerger).mapSample(0xc001453368, 0xc00356a580)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:164 +0x249 fp=0xc001453228 sp=0xc001452fe8 pc=0xad5649
github.com/google/pprof/profile.Merge({0xc001453b08, 0x2, 0x4})
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:75 +0x4e7 fp=0xc001453a78 sp=0xc001453228 pc=0xad5187
gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pprofutils.Delta.Convert({{0xc000222580, 0x27d86, 0x0}}, 0xc004256000, 0xc004256280)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/internal/pprofutils/delta.go:61 +0x146 fp=0xc001453b28 sp=0xc001453a78 pc=0xae4f86
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).deltaProfile(0xc0000b6230, {0x0, {0xdff544, 0x4}, {0xe05506, 0xa}, 0xc00000e648, 0xe44b10}, {0xc002b2a000, 0x27d86, ...})
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:262 +0x137 fp=0xc001453bb0 sp=0xc001453b28 pc=0xae8db7
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc0000b6230, 0x0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:223 +0x405 fp=0xc001453dc0 sp=0xc001453bb0 pc=0xae88c5
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc0000b6230, 0xc000122060)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:188 +0x287 fp=0xc001453f78 sp=0xc001453dc0 pc=0xaeb027
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1()
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:160 +0xda fp=0xc001453fe0 sp=0xc001453f78 pc=0xaeac9a
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc001453fe8 sp=0xc001453fe0 pc=0x46a901
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:155 +0xf2
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0xad63c6]
goroutine 56 [running]:
runtime.throw({0xe00062, 0x4})
/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc00251ee80 sp=0xc00251ee50 pc=0x437771
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:742 +0x2f6 fp=0xc00251eed0 sp=0xc00251ee80 pc=0x44d9b6
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc00251f368, 0x736d203428202e38)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:233 +0x46 fp=0xc00251efe8 sp=0xc00251eed0 pc=0xad63c6
github.com/google/pprof/profile.(*profileMerger).mapSample(0xc00251f368, 0xc002157b00)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:164 +0x249 fp=0xc00251f228 sp=0xc00251efe8 pc=0xad5649
github.com/google/pprof/profile.Merge({0xc00251fb08, 0x2, 0x4})
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:75 +0x4e7 fp=0xc00251fa78 sp=0xc00251f228 pc=0xad5187
gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pprofutils.Delta.Convert({{0xc000222540, 0x38d66, 0x0}}, 0xc002990000, 0xc002990280)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/internal/pprofutils/delta.go:61 +0x146 fp=0xc00251fb28 sp=0xc00251fa78 pc=0xae4f86
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).deltaProfile(0xc000470050, {0x0, {0xdff544, 0x4}, {0xe05506, 0xa}, 0xc000126630, 0xe44b10}, {0xc0028fa000, 0x38d66, ...})
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:262 +0x137 fp=0xc00251fbb0 sp=0xc00251fb28 pc=0xae8db7
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc000470050, 0x0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:223 +0x405 fp=0xc00251fdc0 sp=0xc00251fbb0 pc=0xae88c5
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc000470050, 0xc0000aa6c0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:188 +0x287 fp=0xc00251ff78 sp=0xc00251fdc0 pc=0xaeb027
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1()
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:160 +0xda fp=0xc00251ffe0 sp=0xc00251ff78 pc=0xaeac9a
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00251ffe8 sp=0xc00251ffe0 pc=0x46a901
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:155 +0xf2
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0xb54575]
goroutine 43 [running]:
runtime.throw(0xeee29f, 0x5)
/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc001c4af78 sp=0xc001c4af48 pc=0x43a312
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:741 +0x268 fp=0xc001c4afb0 sp=0xc001c4af78 pc=0x4519a8
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001c4b400, 0x736d203828202e38, 0x0)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:233 +0x55 fp=0xc001c4b0a0 sp=0xc001c4afb0 pc=0xb54575
github.com/google/pprof/profile.(*profileMerger).mapSample(0xc001c4b400, 0xc00533e580, 0xc00118c700)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:164 +0x25c fp=0xc001c4b2b0 sp=0xc001c4b0a0 pc=0xb5341c
github.com/google/pprof/profile.Merge(0xc001c4bba0, 0x2, 0x2, 0x4, 0x0, 0x0)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:75 +0x38f fp=0xc001c4bb10 sp=0xc001c4b2b0 pc=0xb52c6f
gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pprofutils.Delta.Convert(0xc000220500, 0x2, 0x2, 0xc000271a40, 0xc000222000, 0x0, 0x0, 0xc001c4bc28)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/internal/pprofutils/delta.go:61 +0x2d3 fp=0xc001c4bbc0 sp=0xc001c4bb10 pc=0xb665d3
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).deltaProfile(0xc0002b44b0, 0x0, 0xeed78b, 0x4, 0xef371a, 0xa, 0xc00000e618, 0xf30d88, 0xc003ef0000, 0x50188, ...)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:262 +0x2ff fp=0xc001c4bc60 sp=0xc001c4bbc0 pc=0xb6ac9f
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc0002b44b0, 0x0, 0x0, 0x0, 0x1, 0xc000fdc080, 0x0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:223 +0x257 fp=0xc001c4bde0 sp=0xc001c4bc60 pc=0xb6a417
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc0002b44b0, 0xc0000acd20)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:188 +0x186 fp=0xc001c4bf80 sp=0xc001c4bde0 pc=0xb6cea6
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1(0xc0002b44b0)
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:160 +0xe5 fp=0xc001c4bfd8 sp=0xc001c4bf80 pc=0xb70a05
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc001c4bfe0 sp=0xc001c4bfd8 pc=0x474001
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:155 +0xe8

For what I see, the location comes from the sample and the profile that is obtained on dd-trace, I understand that is difficult to happen because of something in another routine because the others routines have no way to know the address of this param, and if it is something random its strange to be the same param that breaks right?

EDIT: Just realising that the address you mentioned its the same on every crash EDIT2: There are some differences on the stacktrace lines or methods between crashes because we upgrade dependencies and GO to test if it was something already solved .

marcosngomezi avatar Apr 19 '22 20:04 marcosngomezi

Maybe I can try to access this address to see what it has? It must be some kind of constant right?

marcosngomezi avatar Apr 19 '22 20:04 marcosngomezi

Possibly, yes, though the address is slightly different in each crash (see below). If you're able to get the crash under a debugger and look at the arguments going into that function, that would probably clarify quite a bit.

The addresses (the second one) are almost the same. Here they are, pulled out so we can see them side-by-side:

github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001285368, 0x736d203328202e38)
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001453368, 0x736d203528202e38)
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc00251f368, 0x736d203428202e38)
github.com/google/pprof/profile.(*profileMerger).mapLocation(0xc001c4b400, 0x736d203828202e38, 0x0)

Interestingly, the last one has a third argument.

nsrip-dd avatar Apr 19 '22 20:04 nsrip-dd

Hmm just quick looked it, if its not the same its not going to be possible. We are not able to reproduce it, as you said its infrequent.

About the third argument, I thought that was because of us updating dependencies, but now I see that its always the same version of pprof so it dont haves much sense

marcosngomezi avatar Apr 19 '22 20:04 marcosngomezi

maybe we can do a version using SetPanicOnFault that panic can give us the faulty address to access and print it, what you think? where do you think that should be the recover and what should do to dont break anything else? EDIT: maybe at Delta.Convert to print the arguments being passed also

marcosngomezi avatar Apr 19 '22 20:04 marcosngomezi

Im going to test it with this fork https://github.com/marcosg-peya/dd-trace-go/tree/sigsegv-test

marcosngomezi avatar Apr 19 '22 23:04 marcosngomezi

Good idea, please report back if you find anything. It might also be worth compiling the program with the address sanitizer (Go 1.18 added the -asan flag), though I'm not 100% sure if all the dependencies are available on Alpine Linux.

nsrip-dd avatar Apr 20 '22 02:04 nsrip-dd

We didnt get any logs of the added ones but got a different error now that I think its a clear clue of what its happening:

fatal error: concurrent map read and map write
goroutine 40 [running]:
runtime.throw({0xe6ab6f, 0x80})
/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc000b1ef18 sp=0xc000b1eee8 pc=0x437811
runtime.mapaccess1_faststr(0xc000b1f300, 0xc0007a98c0, {0xc001c0c380, 0x5})
/usr/local/go/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc000b1ef80 sp=0xc000b1ef18 pc=0x414525
github.com/google/pprof/profile.(*profileMerger).mapSample(0xc000b1f300, 0xc0005f4a80)
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:172 +0x434 fp=0xc000b1f1c0 sp=0xc000b1ef80 pc=0xb0fa94
github.com/google/pprof/profile.Merge({0xc000b1fad8, 0x2, 0x4})
/go/pkg/mod/github.com/google/[email protected]/profile/merge.go:75 +0x4e7 fp=0xc000b1fa10 sp=0xc000b1f1c0 pc=0xb0f3e7
gopkg.in/DataDog/dd-trace-go.v1/profiler/internal/pprofutils.Delta.Convert({{0xc00021e800, 0x13f45, 0x0}}, 0xc000278280, 0xc0016c0280)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/internal/pprofutils/delta.go:86 +0x20b fp=0xc000b1fb28 sp=0xc000b1fa10 pc=0xb1f2ab
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).deltaProfile(0xc000112230, {0x0, {0xe4fc4e, 0x4}, {0xe55d92, 0xa}, 0xc00000e6c0, 0xe96d28}, {0xc00211e000, 0x13f45, ...})
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profile.go:265 +0x14d fp=0xc000b1fbb0 sp=0xc000b1fb28 pc=0xb23dcd
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc000112230, 0x0)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profile.go:221 +0x35e fp=0xc000b1fdc0 sp=0xc000b1fbb0 pc=0xb2383e
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc000112230, 0xc000690660)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:191 +0x287 fp=0xc000b1ff78 sp=0xc000b1fdc0 pc=0xb26047
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:163 +0xda fp=0xc000b1ffe0 sp=0xc000b1ff78 pc=0xb25cba
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000b1ffe8 sp=0xc000b1ffe0 pc=0x46aac1
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:158 +0xf2

Looking if we found something else

marcosngomezi avatar Apr 22 '22 14:04 marcosngomezi

Ah, very interesting! Thank you for running more tests. This looks like a promising lead. Have you tried compiling and running your program with the -race flag to enable the race detector? I am wondering now if the original crash you've reported is also a symptom of a race condition. The race detector doesn't necessarily catch every possible data race but it's worth a shot.

EDIT: I'm running some more race testing on my end and I do strongly suspect this is a data race on our end. EDIT2: Also would you be willing/able to share your profiler configuration? e.g. period, CPU duration, enabled profiles, etc.

nsrip-dd avatar Apr 22 '22 15:04 nsrip-dd

We run our tests always with -race, this are the configs reported on logs when starts:

Datadog Tracer v1.37.1 INFO: DATADOG TRACER CONFIGURATION {"date":"2022-04-22T14:10:07Z","os_name":"Linux (Unknown Distribution)","os_version":"unknown","version":"v1.37.1","lang":"Go","lang_version":"go1.17.9","env":"live","service":"***","agent_url":"http://***/v0.4/traces","agent_error":"","debug":false,"analytics_enabled":false,"sample_rate":"0.100000","sampling_rules":null,"sampling_rules_error":"","service_mappings":null,"tags":{"runtime-id":"***"},"runtime_metrics_enabled":true,"health_metrics_enabled":true,"profiler_code_hotspots_enabled":true,"profiler_endpoints_enabled":true,"dd_version":"***","architecture":"amd64","global_service":"***","lambda_mode":"false","appsec":false,"agent_features":{"DropP0s":true,"Stats":true,"StatsdPort":0}}

EDIT:The only special condition that I cant think of are very short timeout requests, of less than 1ms, that could generate a not usual race-condition.

marcosngomezi avatar Apr 22 '22 15:04 marcosngomezi

Do you have a similar log starting with Profiler configuration:?

nsrip-dd avatar Apr 22 '22 16:04 nsrip-dd

My bad:

Datadog Tracer v1.37.1 INFO: Profiler configuration: {"date":"2022-04-22T14:10:07Z","os_name":"Linux (Unknown Distribution)","os_version":"unknown","version":"v1.37.1","lang":"Go","lang_version":"go1.17.9","hostname":"888","delta_profiles":true,"service":"***","env":"live","target_url":"http://***/profiling/v1/input","agentless":false,"tags":["pid:1","***","profiler_version:v1.37.1","runtime_version:1.17.9","runtime_compiler:gc","runtime_arch:amd64","runtime_os:linux","runtime-id:***","version:***"],"profile_period":"1m0s","enabled_profiles":["metrics","cpu","heap","block","goroutine"],"cpu_duration":"1m0s","block_profile_rate":100000000,"mutex_profile_fraction":10,"max_goroutines_wait":1000,"upload_timeout":"10s"}```

marcosngomezi avatar Apr 22 '22 17:04 marcosngomezi

Did you find it @nsrip-dd ? Is there anything to test at our side?

marcosngomezi avatar Apr 22 '22 21:04 marcosngomezi

Hi @marcosg-peya. Unfortunately I did not find it yet. I found some race issues in our unit tests that I think are different than this issue (fixed in https://github.com/DataDog/dd-trace-go/pull/1255).

Thank you again for the information you've provided so far. To summarize the state of things:

  • You've observed crashes due to bad memory access on this line of code several times, even with different Go, dd-trace-go, and google/pprof versions.
    • The crash is specifically in dereferencing the src pointer to access src.ID. The crash message claims that the bad access is at address 0x0. This is strange because src is checked to be non-nil in the lines immediately before the crash.
  • You've also observed a crash due to concurrent map access on this line of code.
  • In both cases, the data being manipulated are passed in by dd-trace-go to profile.Merge.
  • It appears that some other part of the process (possibly dd-trace-go) is modifying the same memory that dd-trace-go is trying to access while computing delta profiles.

There are a few things I would like to try next:

  • Call debug.SetTraceback("all"), or set GOTRACEBACK=all, to get a stack trace of every goroutine when the program crashes. This would hopefully show us if there are multiple goroutines running in dd-trace-go/profiler code. There should only be one goroutine actually collecting profiles normally.
  • If possible, build the exact same workload that's currently crashing, but with the -race flag. Running a real workload with the race detector can slow things down significantly so I understand if you would be hesitant to do this.

nsrip-dd avatar Apr 25 '22 13:04 nsrip-dd

Hi @nsrip-dd, Unfortunately I can't share the full log but here are the traces of the routines that reference dd-trace-go on the fatal error: concurrent map read and map write

goroutine 75 [select]:
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).reportRuntimeMetrics(0xc0001340a0, 0x2540be400)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/metrics.go:43 +0x1b0
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer.func1()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:224 +0x68
created by gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:222 +0xe7
goroutine 76 [select]:
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).worker(0xc0001340a0, 0xc0000b54a0)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:275 +0xf7
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer.func2()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:236 +0xba
created by gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:228 +0x145
goroutine 77 [select]:
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).reportHealthMetrics(0xc0001340a0, 0x0)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/metrics.go:100 +0x1d7
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer.func3()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:241 +0x65
created by gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newTracer
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/tracer.go:239 +0x19b
goroutine 78 [select]:
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).runFlusher(0xc0000b6280, 0xc00010e660)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:105 +0xbd
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).Start.func1()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:93 +0x98
created by gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).Start
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:89 +0xb3
goroutine 79 [select, 6 minutes]:
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).runIngester(0xc0000b6280)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:136 +0xb4
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).Start.func2()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:98 +0x5a
created by gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*concentrator).Start
/go/pkg/mod/github.com/marcosg-peya/[email protected]/ddtrace/tracer/stats.go:96 +0x107
goroutine 41 [select, 1 minutes]:
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).send(0xc000112230)
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:257 +0xc5
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func2()
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:168 +0x5a
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
/go/pkg/mod/github.com/marcosg-peya/[email protected]/profiler/profiler.go:166 +0x147

There are others related to the datadog-go lib but I understand that those are not relevant.

marcosngomezi avatar Apr 25 '22 19:04 marcosngomezi

Thanks, that's helpful. I assume that the goroutines in that stack trace would also include the stack trace you shared in this comment?

I see the (*profiler).send and (*profiler).collect goroutines, which is what I would expect normally. The send goroutine is only responsible for uploading the profiles after they've already been encoded. The collect goroutine collects the profiles sequentially in a loop and does the manipulation, including profile.Merge. I don't see any other goroutines from the profiler package, such as a leaked goroutine that might be concurrently accessing the profiles.

Are there any other stacks with calls like runtime.mapassign or runtime.mapdelete? If not, I think the "concurrent map read and write" is not caused by the profiler library actually concurrently reading and writing the map. I think this might be caused by other code writing to memory it's not normally supposed to. For example, this could be caused by a C library if your program (or another dependency) uses cgo. Or it could come from a package using unsafe.Pointer to manipulate memory in a way that writes out of bounds or does invalid pointer manipulation. If something like this is happening, it is possible that the profiler data structures coincidentally end up in the region of memory that's getting accessed incorrectly. One way to test this would be to completely disable the profiler and see if the program still has crashes.

There are also a few tools available to detect invalid memory access, if that's what is happening:

  • For Go code that uses unsafe.Pointer and does pointer arithmetic, the code can be compiled like so to add extra pointer checking: go build -gcflags='-d=checkptr'. This works on Alpine Linux.
  • For C code, Go 1.18 adds support for using an address sanitizer by building with the -asan option. The address sanitizer instruments C code to detect many kinds of invalid memory access. I don't know if it's possible to use on Alpine Linux but it should be available on something like golang:1.18-buster.

If you try any of these things and don't get any results, please let me know if you'd like to set up a Zoom meeting so we can walk through your code, dependencies, profiles, logs, etc. together.

nsrip-dd avatar Apr 26 '22 17:04 nsrip-dd

Hi @nsrip-dd, sorry for the lack of response, I had to shift to others issues, in resumee we are not using unsafe package on our code, there were no other stacks calling mapassign or mapdelete. We turned off the profiler and it stopped happening, I will turn it on again to see if it happens again and contact you to review it if it does.

marcosngomezi avatar May 23 '22 17:05 marcosngomezi