dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

CI Visibility: tries to send telemetry and trace data to local agent even when agentless mode is enabled

Open ento opened this issue 1 year ago • 16 comments

Summary of problem

I'm trying out CI Visibility with a pytest test suite, and I'm seeing log messages like "failed to send, dropping 415 traces to intake at http://localhost:8126/v0.4/traces after 3 retries, 44 additional messages skipped".

Environment is GitHub Actions (hosted runner) with no datadog agent running.

Which version of dd-trace-py are you using?

Saw the error message with these versions: 1.20.14 2.3.3 4577f757 (latest commit as of this writing)

Which version of pip are you using?

23.3.2

Which libraries and their versions are you using?

`pip freeze` against the virtualenv that I set up locally using the steps below
$ pip freeze
attrs==23.1.0
bytecode==0.15.1
cattrs==23.2.3
coverage==7.3.4
ddsketch==2.0.4
ddtrace @ https://github.com/DataDog/dd-trace-py/archive/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22.zip#sha256=9e07cc061830bce8ef9fdd8e05acdcae6652e073fe7e07cbd4ae83fb17eceaa9
Deprecated==1.2.14
envier==0.5.0
importlib-metadata==6.11.0
iniconfig==2.0.0
opentelemetry-api==1.22.0
packaging==23.2
pluggy==1.3.0
protobuf==4.25.1
pytest==7.4.3
pytest-cov==4.1.0
six==1.16.0
typing_extensions==4.9.0
wrapt==1.16.0
xmltodict==0.13.0
zipp==3.17.0

How can we reproduce your problem?

Set up a virtualenv with ddtrace and pytest installed

python3 -m venv env
. env/bin/activate
SETUPTOOLS_SCM_PRETEND_VERSION=2.4.0dev pip install https://github.com/DataDog/dd-trace-py/archive/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22.zip
pip install pytest pytest-cov
echo 'def test(): pass' > test.py

Run a mock HTTP server (to be used as CI Visibility's intake host). This isn't necessary to reproduce, but it quiets unrelated errors.

docker run --rm -p 8080:8080 -t mendhak/http-https-echo:31

What is the result that you get?

With --ddtrace enabled:

DD_CIVISIBILITY_AGENTLESS_ENABLED=true DD_TRACE_DEBUG=1 DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080  pytest test.py --ddtrace 2>&1 | grep 8126
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

With --ddtrace-patch-all and --cov (to generate spans) enabled in addition

$ DD_CIVISIBILITY_AGENTLESS_ENABLED=true DD_TRACE_DEBUG=1 DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080  pytest test.py --ddtrace --ddtrace-patch-all --cov 2>&1 | grep 8126
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
creating new intake connection to http://localhost:8126 with timeout 2
creating new intake connection to http://localhost:8126 with timeout 2
creating new intake connection to http://localhost:8126 with timeout 2
failed to send, dropping 65 traces to intake at http://localhost:8126/v0.5/traces after 3 retries
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.
failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

Requests get sent to the mock HTTP server that DD_CIVISIBILITY_AGENTLESS_URL points to, so I believe agentless mode gets enabled as intended with this setup.

::ffff:172.17.0.1 - - [22/Dec/2023:23:09:49 +0000] "POST /api/v2/citestcycle HTTP/1.1" 200 6860 "-" "-"

What is the result that you expected?

No failure trying to send telemetry or traces.

When I view test runs in Datadog, I only see top-level spans for each test case and no spans for database access, etc., which may be because ddtrace isn't able to send traces.

ento avatar Dec 22 '23 23:12 ento

@ento , thanks for the great bug report.

The fact that telemetry doesn't properly work with our agentless mode enabled is a known issue (and usually hidden by the fact that it's a debug-level log). Fixing that is a todo item for this coming quarter.

The fact that traces don't work with agentless is combined with --ddtrace-patch-all, on the other hand, wasn't something I was aware of, and that definitely also needs a fix.

romainkomorndatadog avatar Jan 08 '24 14:01 romainkomorndatadog

Hi @romainkomorndatadog !

Can you share a workaround to silence those debug logs? It's getting rather noisy on our test runs to the level of deactivating ddtrace for those test runs, which is not what I'd like to do for sake of observability.

wearebasti avatar Jan 24 '24 08:01 wearebasti

@wearebasti , are you asking specifically about the log entries stating

failed to send telemetry to the Datadog Agent at http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

? DD_INSTRUMENTATION_TELEMETRY_ENABLED=false should stop those (although they won't stop the similar ones about http://localhost:8126/v0.5/traces ).

That said, those messages should only appear if you have DD_TRACE_DEBUG=1, and that would generate a lot of other unrelated debug messages that would be a lot more noisy, so it feels like we're talking about different things.

romainkomorndatadog avatar Jan 24 '24 09:01 romainkomorndatadog

@romainkomorndatadog Thanks for picking this up, I was asking about the messages in the format

failed to send, dropping 17 traces to intake at http://localhost:8126/v0.5/traces after 3 retries, 10 additional messages skipped

Our Github Actions do not use DD_TRACE_DEBUG and our test call is a simple pytest --ddtrace -n auto --junitxml=test-report.xml, also setting explicitly DD_TRACE_DEBUG=false does not change this behaviour.

Sorry if I hijack here the wrong issue, I thought this was the same root cause -- do you have any ideas?

wearebasti avatar Jan 24 '24 15:01 wearebasti

@wearebasti , you're in the right issue (not hijacking it).

We do end up logging an error when we can't send traces (so it would indeed show up even with DD_TRACE_DEBUG=0).

A few questions:

  • Which versions of python, ddtrace, and pytest are you running?
  • Do the lines appear in file logs, or in CLI logs (or both)?
  • Do you have a pytest.ini config, PYTEST_ADDOPTS env var, or something else that might be changing the CLI or file log levels?
  • Are you using the agentless mode for our plugin (ie using DD_CIVISIBILITY_AGENTLESS_ENABLED=true)?

I wasn't able to reproduce the logging myself until I used --log-cli-level error as an arg to pytest (which I rarely ever use in testing), and ran with neither an agent or agentless mode enabled.

As far as making the log entries errors go away, depending on your situation:

  • running in agentless mode should make some of the error-level messages go away
  • running in an environment with an agent should make the rest go away (but that's a bad answer)
  • waiting for us to get a chance to fix the codepaths that don't currently support our agentless protocol (but that is also a bad answer because I've got some other work slotted ahead of this)

romainkomorndatadog avatar Jan 24 '24 17:01 romainkomorndatadog

@wearebasti , you're in the right issue (not hijacking it).

@romainkomorndatadog A relief! :)

  • Which versions of python, ddtrace, and pytest are you running?

Python: 3.12 (using actions/setup-python@v5 in github actions) ddtrace: 2.4.0 pytest: 7.4.0

  • Do the lines appear in file logs, or in CLI logs (or both)?

They appear in the github actions output, which appears to be some kind of "stdout" or "stderr" output

  • Do you have a pytest.ini config, PYTEST_ADDOPTS env var, or something else that might be changing the CLI or file log levels?

We use pyproject.toml for settings, but only have the following

[tool.pytest.ini_options]
DJANGO_SETTINGS_MODULE = "foo.settings_test"
timeout = "300"
python_files = [
    "tests.py",
    "test_*.py",
    "*_tests.py",
]
addopts = [
    "--reuse-db",
    "--condensed-warnings",
    "--strict-markers",
    "--doctest-modules",
    "--browser", "chromium",
]

markers = [
    "slow: marks tests as slow",
    ....and a couple more
]
  • Are you using the agentless mode for our plugin (ie using DD_CIVISIBILITY_AGENTLESS_ENABLED=true)? yes, the settings are the following
    DD_CIVISIBILITY_AGENTLESS_ENABLED: true
    DD_CIVISIBILITY_ITR_ENABLED: true
    DD_ENV: ci
    DD_API_KEY: ***
    DD_SITE: datadoghq.eu

DISCLAIMER: We've just today updated to ddtrace 2.5.1 and explicitly set now DD_CIVISIBILITY_LOG_LEVEL: "error", for now the "failed to send traces" did not show up. But what seems to be new is in-between some deprecation warnings (from pydantic) being printed after the pytest run, we get the following 4 lines

{"event": "logger setup complete"}
{"event": "Datadog CI Visibility using agentless mode"}
{"event": "Intelligent Test Runner test skipping disabled by API"}
{"event": "CODEOWNERS file is not available"}

wearebasti avatar Jan 25 '24 10:01 wearebasti

@wearebasti , we don't use GitHub actions for the jobs that run our tracer's tests with the ddtrace plugin enabled. Maybe the logger setup ends up being a bit different in that environment than in my local shell (but that's still TBD).

The first two new messages you saw (logger setup and CI visibility) messages come from #7955 (and #7963 which will be in 2.6.0 introduces some more logging on top of this). I thought (and still think) they're an improvement, because they actually show what our plugin's doing within pytest. The other two should have been around for some time. Something's a little odd if you're seeing those if you set the log level to error since those messages are info or warning-level messages.

That said, the discussion about the logging changes from those PRs probably does belong in a separate issue, so that we can leave this one to the actual problem of certain traces failing to be properly sent in agentless mode. :)

romainkomorndatadog avatar Jan 25 '24 14:01 romainkomorndatadog

Any update on this? Just ran into the same issue using tox and pytest.

skuam avatar Mar 04 '24 09:03 skuam

@skuam , no updates, no. As far as I understand, unless someone specifically wants trace data in agentless mode (which we don't support at the moment), then this is a cosmetic issue, rather than a functional one, so I haven't given it much priority compared to other work I've got on deck.

Are you concerned with the logging, or with the lack of lower-level trace data in agentless mode?

romainkomorndatadog avatar Mar 04 '24 13:03 romainkomorndatadog

unless someone specifically wants trace data in agentless mode (which we don't support at the moment)

Huh, I'd assumed trace data was supposed to be available in agentless mode - one motivation in trying CI Visibility was as an aid to find bottlenecks / areas for optimization in the test suite. (I know there's a GHA action for launching an agent; I haven't gotten around to switching to it.)

If trace data in agentless mode won't be supported in the foreseeable future, it'd be nice to have that stated in the test visibility setup doc - unless it's there and I just missed it: https://docs.datadoghq.com/tests/setup/python?tab=cloudciprovideragentless

this is a cosmetic issue

One actual change that this prompted me to make to our test suite was to fix tests that assert that certain logs were written to only look at logs that came from the logger that the test is interested in. Otherwise, pytest's caplog fixture would pick up logs from ddtrace and cause flaky failures. ddtrace is not the only library that could emit logs in the background, so I think this was the right change regardless of ddtrace's behavior, but thought I'd share.

ento avatar Mar 05 '24 19:03 ento

@ento I'll look at if we can slot supporting trace data into our schedule somewhere. It likely won't happen for the next month, though. I'll add a note to our docs. Thanks for raising that.

As far as logging goes, I think it would make sense to get ddtrace's logs excluded by default (via some handler magic), although I think that presents a different issue for people who are using ddtrace in their code and tests (rather than just the pytest integration).

The line between "ddtrace internals" and integrations / "external" code is a bit murky, but there's probably a "pretty good" default behavior to find.

romainkomorndatadog avatar Mar 06 '24 09:03 romainkomorndatadog

+1 on supporting traces in agent less mode, we definitely need this feature to debug our slow tests and have a holistic view on bottlenecks! @romainkomorndatadog do you think we can prioritize this to be done sooner?

hdadkhah1 avatar Mar 17 '24 23:03 hdadkhah1

More or less repeating https://github.com/DataDog/dd-trace-py/issues/8710#issuecomment-2063335333 on this issue as well: any help validating the fix in #9016 by running tests using ddtrace installed with

pip install "ddtrace @ git+https://github.com/DataDog/dd-trace-py.git@79366300167b0a00886ee60b8db5839df23ef864"

would be appreciated.

romainkomorndatadog avatar Apr 18 '24 08:04 romainkomorndatadog

Sorry it took a while to check it out.

:white_check_mark: When I tried it against a minimal test that makes an HTTP request, a corresponding span for the request showed up in CI Visibility.

Test setup

test.py:

import requests

def test():
    requests.get("http://example.com")

Command:

DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py --ddtrace --ddtrace-patch-all --cov

I switched to using real API credentials since DD_API_KEY=a DD_CIVISIBILITY_AGENTLESS_URL=http://localhost:8080 resulted in an authentication error and CI Visibility got disabled.

[Datadog CI Visibility] WARNING ddtrace.internal.ci_visibility.recorder:recorder.py:595 Authentication error, disabling CI Visibility, please check Datadog API key

:x: When I updated ddtrace in my project to the revision and ran the test suite in GitHub Actions, progress stalled at the 99% mark and it eventually hit the timeout set for the workflow after 25 minutes.

  - Installing ddtrace (2.9.0.dev85+g7936630016 7936630)
...
============================= test session starts ==============================
...
Error: The action 'Run command' has timed out after 40 minutes.

On a separate run with DD_TRACE_DEBUG=1 enabled, I see a few logs that reference a local HTTP address interspersed between test execution logs:

2024-05-10T03:47:50.2072656Z creating new intake connection to http://localhost:8126 with timeout 2

Then at the end, there's lots of ValueErrors:

ValueError: I/O operation on closed file.

This continued for about 29 minutes before hitting the workflow timeout.

$ grep Traceback gha.log  | wc -l
5266
$ grep ValueError gha.log  | wc -l
5266
$ grep 'operation on closed' gha.log  | wc -l
5266
$ grep Traceback gha.log  | head -n1
2024-05-10T03:51:54.3355738Z Traceback (most recent call last):
$ grep Traceback gha.log  | tail -n1
2024-05-10T04:20:41.1918860Z Traceback (most recent call last):
The first few tracebacks

2024-05-10T03:51:54.3349534Z sent 904 in 0.31657s to https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry. response: 202
2024-05-10T03:51:54.3350540Z --- Logging error ---
2024-05-10T03:51:54.3355738Z Traceback (most recent call last):
2024-05-10T03:51:54.3367295Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.3368023Z     stream.write(msg + self.terminator)
2024-05-10T03:51:54.3368447Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.3368789Z Call stack:
2024-05-10T03:51:54.3369368Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.3370004Z     self._bootstrap_inner()
2024-05-10T03:51:54.3370973Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.3371621Z     self.run()
2024-05-10T03:51:54.3372584Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.3373436Z     self._target()
2024-05-10T03:51:54.3374426Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 759, in periodic
2024-05-10T03:51:54.3375550Z     self._client.send_event(telemetry_event)
2024-05-10T03:51:54.3376627Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 180, in send_event
2024-05-10T03:51:54.3377892Z     log.debug("sent %d in %.5fs to %s. response: %s", len(rb_json), sw.elapsed(), self.url, resp.status)
2024-05-10T03:51:54.3379175Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.3379843Z     self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.3380487Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.3381100Z     self.handle(record)
2024-05-10T03:51:54.3382150Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.3383271Z     super(DDLogger, self).handle(record)
2024-05-10T03:51:54.3383728Z Message: 'sent %d in %.5fs to %s. response: %s'
2024-05-10T03:51:54.3384689Z Arguments: (904, 0.3165722100000039, 'https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry', 202)
2024-05-10T03:51:54.4524441Z [PID 3558 | PPID 3501] Subscriber ASM is getting data
2024-05-10T03:51:54.4525524Z --- Logging error ---
2024-05-10T03:51:54.4526126Z Traceback (most recent call last):
2024-05-10T03:51:54.4527550Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.4528462Z     stream.write(msg + self.terminator)
2024-05-10T03:51:54.4528949Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.4529380Z Call stack:
2024-05-10T03:51:54.4530634Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.4531938Z     self._bootstrap_inner()
2024-05-10T03:51:54.4533191Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.4533943Z     self.run()
2024-05-10T03:51:54.4535372Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.4536498Z     self._target()
2024-05-10T03:51:54.4538357Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/_subscribers.py", line 45, in periodic
2024-05-10T03:51:54.4539515Z     log.debug("[PID %d | PPID %d] %s is getting data", os.getpid(), os.getppid(), self)
2024-05-10T03:51:54.4540365Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.4540995Z     self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.4541627Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.4542232Z     self.handle(record)
2024-05-10T03:51:54.4543148Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.4543995Z     super(DDLogger, self).handle(record)
2024-05-10T03:51:54.4544427Z Message: '[PID %d | PPID %d] %s is getting data'
2024-05-10T03:51:54.4545468Z Arguments: (3558, 3501, RemoteConfigSubscriber(status=<ServiceStatus.RUNNING: 'running'>, _interval=2.5))
2024-05-10T03:51:54.4546157Z [PID 3558 | PPID 3501] Subscriber ASM got data
2024-05-10T03:51:54.4546550Z --- Logging error ---
2024-05-10T03:51:54.4546830Z Traceback (most recent call last):
2024-05-10T03:51:54.4547469Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.4548097Z     stream.write(msg + self.terminator)
2024-05-10T03:51:54.4548459Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.4548794Z Call stack:
2024-05-10T03:51:54.4549507Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.4550110Z     self._bootstrap_inner()
2024-05-10T03:51:54.4550716Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.4551326Z     self.run()
2024-05-10T03:51:54.4552206Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.4553029Z     self._target()
2024-05-10T03:51:54.4554002Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/_subscribers.py", line 47, in periodic
2024-05-10T03:51:54.4555093Z     log.debug("[PID %d | PPID %d] %s got data", os.getpid(), os.getppid(), self)
2024-05-10T03:51:54.4555877Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.4556514Z     self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.4557143Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.4557739Z     self.handle(record)
2024-05-10T03:51:54.4558636Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.4559478Z     super(DDLogger, self).handle(record)
2024-05-10T03:51:54.4559879Z Message: '[PID %d | PPID %d] %s got data'
2024-05-10T03:51:54.4560793Z Arguments: (3558, 3501, RemoteConfigSubscriber(status=<ServiceStatus.RUNNING: 'running'>, _interval=2.5))
2024-05-10T03:51:54.6798280Z sent 727 in 0.34234s to https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry. response: 202
2024-05-10T03:51:54.6799658Z --- Logging error ---
2024-05-10T03:51:54.6800151Z Traceback (most recent call last):
2024-05-10T03:51:54.6801580Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:54.6802777Z     stream.write(msg + self.terminator)
2024-05-10T03:51:54.6803444Z ValueError: I/O operation on closed file.
2024-05-10T03:51:54.6804041Z Call stack:
2024-05-10T03:51:54.6805031Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:54.6806140Z     self._bootstrap_inner()
2024-05-10T03:51:54.6807269Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:54.6808422Z     self.run()
2024-05-10T03:51:54.6810057Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:54.6811555Z     self._target()
2024-05-10T03:51:54.6813471Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 759, in periodic
2024-05-10T03:51:54.6815199Z     self._client.send_event(telemetry_event)
2024-05-10T03:51:54.6817143Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/telemetry/writer.py", line 180, in send_event
2024-05-10T03:51:54.6819226Z     log.debug("sent %d in %.5fs to %s. response: %s", len(rb_json), sw.elapsed(), self.url, resp.status)
2024-05-10T03:51:54.6821258Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:54.6822429Z     self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:54.6823573Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:54.6824688Z     self.handle(record)
2024-05-10T03:51:54.6826343Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:54.6828199Z     super(DDLogger, self).handle(record)
2024-05-10T03:51:54.6828980Z Message: 'sent %d in %.5fs to %s. response: %s'
2024-05-10T03:51:54.6830454Z Arguments: (727, 0.342344212999933, 'https://instrumentation-telemetry-intake.datadoghq.com/api/v2/apmtelemetry', 202)
2024-05-10T03:51:55.6792971Z Agent is down or Remote Config is not enabled in the Agent
2024-05-10T03:51:55.6794098Z Check your Agent version, you need an Agent running on 7.39.1 version or above.
2024-05-10T03:51:55.6795225Z Check Your Remote Config environment variables on your Agent:
2024-05-10T03:51:55.6796021Z DD_REMOTE_CONFIGURATION_ENABLED=true
2024-05-10T03:51:55.6796903Z See: https://docs.datadoghq.com/agent/guide/how_remote_config_works/
2024-05-10T03:51:55.6797997Z --- Logging error ---
2024-05-10T03:51:55.6798525Z Traceback (most recent call last):
2024-05-10T03:51:55.6799722Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1103, in emit
2024-05-10T03:51:55.6800923Z     stream.write(msg + self.terminator)
2024-05-10T03:51:55.6801338Z ValueError: I/O operation on closed file.
2024-05-10T03:51:55.6801681Z Call stack:
2024-05-10T03:51:55.6802251Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 973, in _bootstrap
2024-05-10T03:51:55.6802887Z     self._bootstrap_inner()
2024-05-10T03:51:55.6803552Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-05-10T03:51:55.6804187Z     self.run()
2024-05-10T03:51:55.6805146Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/periodic.py", line 56, in run
2024-05-10T03:51:55.6806087Z     self._target()
2024-05-10T03:51:55.6807072Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/worker.py", line 74, in periodic
2024-05-10T03:51:55.6807988Z     return self._state()
2024-05-10T03:51:55.6809000Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/remoteconfig/worker.py", line 53, in _agent_check
2024-05-10T03:51:55.6809919Z     log.debug(
2024-05-10T03:51:55.6810471Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1465, in debug
2024-05-10T03:51:55.6811102Z     self._log(DEBUG, msg, args, **kwargs)
2024-05-10T03:51:55.6811757Z   File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/logging/__init__.py", line 1624, in _log
2024-05-10T03:51:55.6812377Z     self.handle(record)
2024-05-10T03:51:55.6813274Z   File "/home/runner/.cache/pypoetry/virtualenvs/strive-cvRqnyNv-py3.10/lib/python3.10/site-packages/ddtrace/internal/logger.py", line 148, in handle
2024-05-10T03:51:55.6814129Z     super(DDLogger, self).handle(record)
2024-05-10T03:51:55.6815843Z Message: 'Agent is down or Remote Config is not enabled in the Agent\nCheck your Agent version, you need an Agent running on 7.39.1 version or above.\nCheck Your Remote Config environment variables on your Agent:\nDD_REMOTE_CONFIGURATION_ENABLED=true\nSee: https://docs.datadoghq.com/agent/guide/how_remote_config_works/'
2024-05-10T03:51:55.6817489Z Arguments: ()

Not sure if this is a separate issue. (Those ValueErrors could be a side-effect of enabling DD_TRACE_DEBUG and not the direct cause of the stall.)

ento avatar May 10 '24 05:05 ento

@ento , thanks for the info. Just to clarify, your simple single-file test.pytest does not hang, only your larger project, correct?

I'm going to try and set up a GitHub Action to test this because I can't repro locally (and it doesn't repro in our CircleCI project for the various dd-trace-py test suites that use our plugin).

If you have time to test it, DD_REMOTE_CONFIGURATION_ENABLED=false should disable the remote config part that tries to talk to our agent (I guess that's something I should also disable or make work without an agent), but I don't see why that should cause a hang (eg: in my local testing without an agent, I see the messages that the agent is down or RC is not enabled in the agent, but I don't have a hang behavior).

I'd ask you to try comparing with not using --ddtrace-patch-all, but I'm not sure how useful that would be in actually narrowing down the issue. Maybe a more interesting test would be to compare against ddtrace~=2.8.0 with --ddtrace-patch-all.

The ValueError looks similar to another issue I've got open internally which I noticed on some of our GitlabCI jobs. It looks like something is closing stderr/stdout and the log handler we set up when pytest starts ends up trying to write to a closed file. I do think that's a side effect rather than a cause of any hangs, though. Since you're using DD_TRACE_DEBUG=1, our logger takeover function should be short-circuiting and not taking over the handler, so that means that particular issue isn't limited to the CI Visibility log handler (which is what I originally thought).

Side note for curiosity: in debug mode, the creating new intake connection to http://localhost:8126 come from the debug mode collector which spins up before our pytest plugin has a chance to reconfigure the writer). That tells me maybe we do need to just move the agentless behavior into the default tracer.

romainkomorndatadog avatar May 10 '24 13:05 romainkomorndatadog

Just to clarify, your simple single-file test.pytest does not hang, only your larger project, correct?

That was the case when I wrote the comment, but I've managed to make (a variation of) the single-file test case hang.

First, install a specific version of freezegun:

pip install -U freezegun==1.2.2

Test file (a distilled version of the test suite that I ran in GitHub Actions)

import freezegun
from ddtrace import tracer

@tracer.wrap()
def hello():
    return "hello"

@freezegun.freeze_time("2024-01-01")
def test():
    assert hello() == "hello"

Hangs:

pytest test.py -p ddtrace
pytest test.py -p no:ddtrace

Doesn't hang:

DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace --ddtrace
DD_REMOTE_CONFIGURATION_ENABLED=false DD_CIVISIBILITY_AGENTLESS_ENABLED=true pytest test.py -p ddtrace

Adding the --ddtrace flag, which I forgot to add when I enabled CI Visibility, and/or disabling remote configuration fixed my GitHub Actions workflow as well.

Upgrading freezegun also makes the issue go away

pip install -U freezegun==1.3.0

The place it was hanging was: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/internal/remoteconfig/client.py#L217 which calls: https://github.com/DataDog/dd-trace-py/blob/79366300167b0a00886ee60b8db5839df23ef864/ddtrace/settings/config.py#L586

Because of the time freeze, time.monotonic() wasn't increasing and the internal queue was waiting forever here: https://github.com/python/cpython/blob/v3.11.6/Lib/multiprocessing/connection.py#L952

ento avatar May 16 '24 19:05 ento