dd-trace-php
dd-trace-php copied to clipboard
reset tracing when pcntl_fork is called
This will do 4 things:
- reset the open, closed, root and dropped spans globals
- reseed the random number generator used to assign span ids
- generate a root span if configured to do so
- turn background sending off and then on again, to remove timeouts when communicating with the agent
Hey @jlesueur, thanks for the contribution!
I think this is a pretty great start!
On top of that, I think it should first assign the original top span id DDTRACE_G(open_spans_top)->span.span_id
(if open_spans_top is not NULL) to DDTRACE_G(distributed_parent_trace_id)
so that distributed tracing works.
@jlesueur Do you mind if we take over the PR and commit to your branch to finish it up? :-)
@bwoebi I do not mind at all. I did push a fix just now before I saw your comment, but am happy to hand this off from here.
Great, thanks a lot for the start, I'll take over from here then :-)
Another open question is: Shall we auto-instrument pcntl_fork by default, so that traces of forks always have a pcntl_fork span as parent? Instead of being on the same level with other parent process spans.
So that you have some use cases to guide you in answering these questions, we have two main uses in our code base:
- an event processor pulling events off a queue, then forking to handle each event. Each fork is one of a couple dozen different event handlers. We use the event handler as the resource name in the child fork. We do not care about traces from the parent that pulls events off the queue.
- various cron jobs that fork to perform work for each tenant. We care about the parent as well as the child trace. The parent forks up to 30k times, running 20-30 processes simultaneously. This tends not to look very good in the flame chart if the parent trace contains the child traces. It would be nice if the distributed tracing connection from the parent fork to the child fork was a configuration.
In both cases, we see each child fork as a complete process, similar to an http request.
We're testing this PR, we get periodic hangs while cleaning up curl handlers, from gdb: (line numbers may not exactly match up). Pretty sure it's a cleanup that doesn't get all the way done when forking.
#0 zend_hash_index_del (ht=0x7f5ba66dfd40, h=1621) at ./Zend/zend_hash.c:1520
#1 0x00007f5ba899e08a in dd_multi_remove_handle (ch=0x7f5babc15d60, mh=0x7f5babc15d50)
at /home/circleci/app/ext/php7/handlers_curl.c:288
#2 zif_ddtrace_curl_multi_remove_handle (execute_data=0x7f5babc15d00, return_value=0x7ffe88662280)
at /home/circleci/app/ext/php7/handlers_curl.c:443
#3 0x00007f5babead0ad in xdebug_execute_internal (current_execute_data=0x7f5babc15d00, return_value=0x7ffe88662280)
at ./build-7.4/src/base/base.c:897
#4 0x00007f5ba8988238 in zai_interceptor_execute_internal_impl (prev=true, return_value=0x7ffe88662280,
execute_data=0x7f5babc15d00) at /home/circleci/app/zend_abstract_interface/interceptor/php7/interceptor.c:413
#5 zai_interceptor_execute_internal_impl (prev=true, return_value=0x7ffe88662280, execute_data=0x7f5babc15d00)
at /home/circleci/app/zend_abstract_interface/interceptor/php7/interceptor.c:390
#6 zai_interceptor_execute_internal (execute_data=0x7f5babc15d00, return_value=0x7ffe88662280)
at /home/circleci/app/zend_abstract_interface/interceptor/php7/interceptor.c:425
#7 0x0000558b09375020 in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1620
#8 execute_ex (ex=0x7f5ba66dfd40) at ./Zend/zend_vm_execute.h:53521
#9 0x00007f5babeac7d2 in xdebug_execute_ex (execute_data=0x7f5babc15bd0) at ./build-7.4/src/base/base.c:779
#10 0x0000558b09374f6e in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1602
#11 execute_ex (ex=0x7f5ba66dfd40) at ./Zend/zend_vm_execute.h:53521
#12 0x00007f5babeac7d2 in xdebug_execute_ex (execute_data=0x7f5babc158e0) at ./build-7.4/src/base/base.c:779
#13 0x0000558b09374f6e in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1602
#14 execute_ex (ex=0x7f5ba66dfd40) at ./Zend/zend_vm_execute.h:53521
#15 0x00007f5babeac7d2 in xdebug_execute_ex (execute_data=0x7f5babc15730) at ./build-7.4/src/base/base.c:779
#16 0x0000558b09374f6e in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1602
#17 execute_ex (ex=0x7f5ba66dfd40) at ./Zend/zend_vm_execute.h:53521
#18 0x00007f5babeac7d2 in xdebug_execute_ex (execute_data=0x7f5babc15690) at ./build-7.4/src/base/base.c:779
#19 0x0000558b092df8af in zend_call_function (fci=fci@entry=0x7ffe88662780, fci_cache=fci_cache@entry=0x7ffe88662760)
at ./Zend/zend_execute_API.c:820
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x0000558b0931fab2 in zend_objects_destroy_object (object=0x7f5b9f77a820) at ./Zend/zend_objects.c:179
#21 0x0000558b093248af in zend_objects_store_call_destructors (objects=objects@entry=0x558b094f0fc8 <executor_globals+840>)
at ./Zend/zend_objects_API.c:56
#22 0x0000558b092de4eb in shutdown_destructors () at ./Zend/zend_execute_API.c:247
#23 0x0000558b092ee235 in zend_call_destructors () at ./Zend/zend.c:1183
relevant code:
if (readfunc && Z_TYPE_P(readfunc) == IS_OBJECT && Z_OBJCE_P(readfunc) == &dd_curl_wrap_handler_ce) {
struct dd_curl_wrapper *wrapper = (struct dd_curl_wrapper *)Z_OBJ_P(readfunc);
zend_hash_index_del(&wrapper->multi, Z_RES_HANDLE_P(mh));
}
So, it is looping indefinitely within that topmost zend_hash_index_del call? Otherwise, I'm not seeing any particular location which loops here in the stacktrace?
yes, that's where I see it looping when I get it into gdb.
GDB shows it stuck inside this while: https://heap.space/xref/PHP-7.4/Zend/zend_hash.c?r=2c508c4d#1513 (copy/paste from the source doesn't do well in github comments...)
The pure action of forking should not affect this at all or require any cleanup here. In fact, that should be totally unrelated to this particular PR, but be a pre-existing bug.
I am suspecting there is a rare problem with destruction order, which causes the internal objects (we create internal objects on PHP 7 to store some curl data directly on the handles, because WeakMap is not yet a thing on PHP 7) containing the curl multi mappings to be destroyed before the curl handles themselves, leading to unexpected behaviours (I'm surprised it doesn't crash, but things happen sometimes - with a debug build it should give an assertion failure or outright crash in any case though). I will try to verify the bug tomorrow.
It was indeed trivial to reproduce the issue (this is with a debug build and crashes as I suspected):
circleci(buster):~/app$ DD_TRACE_CLI_ENABLED=1 php ../r.php
/usr/local/src/php/Zend/zend_hash.c(1497) : ht=0x7f9642e89040 is already destroyed
php: /usr/local/src/php/Zend/zend_hash.c:71: _zend_is_inconsistent: Assertion `0' failed.
Aborted (core dumped)
circleci(buster):~/app$ cat ../r.php
<?php
$ch = curl_init();
$mh = curl_multi_init();
curl_multi_add_handle($mh, $ch);
class A {
public static $a;
public $ch;
public $mh;
function __destruct() {
curl_multi_remove_handle($this->mh, $this->ch);
}
}
A::$a = new A;
A::$a->ch = $ch;
A::$a->mh = $mh;
@jlesueur I've just merged a fix for this particular issue into master and rebased this branch on master. If you want to continue testing this PR, feel free, it's appreciated!
I was able to confirm, this does fix the hang I was seeing.
I've got other priorities right now, but we're definitely still planning to release this quite soon, sorry that I haven't gotten to it in time for 0.77 - but I'll try to target the next one in any case!
Thank you. We tested this in local/stage, and are running this in production for the past week, and seeing good results. We'll be happy to get back onto a mainline version of the extension :)
On Mon, Aug 1, 2022 at 10:17 AM Bob Weinand @.***> wrote:
I've got other priorities right now, but we're definitely still planning to release this quite soon, sorry that I haven't gotten to it in time for 0.77 - but I'll try to target the next one in any case!
— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-php/pull/1633#issuecomment-1201417293, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHGWFAAO45UQMVTMT7LK6LVW72DPANCNFSM52K3GJEA . You are receiving this because you were mentioned.Message ID: @.***>
Any news? Our company is also very interested in this fix.
I'm looking at this again now, adding arm64 support had a higher priority. I definitely want to finish this up very soon!
Thank you very much, everyone!