opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

otelcol.Collector.DryRun() now instantiates all components

Open SeanPMiller opened this issue 1 year ago • 7 comments

Describe the bug This behavior might be intended. However, if you write your own entrypoint into the collector, as I have, and you call otelcol.Collector.DryRun() with a goal of determining whether the configuration "looks valid" before calling Run(), then service.New() is now called as part of the dry run, and its return value is discarded. This behavior is a consequence of this commit.

Due to the above behavior, calling DryRun() now, for example, calls open() and bind() for the self-monitoring telemetry-service port and the health-check extension port, among other ports. When Run() is subsequently called, the second instance of each server component fails with a bind() error because its port has already been opened and bound.

This might be a misuse of the DryRun() function. Let me know, and thanks in advance.

Steps to reproduce Write the above code.

What did you expect to see? The old behavior.

What did you see instead?

Error: failed to start extensions: failed to bind to address 0.0.0.0:13133: listen tcp 0.0.0.0:13133: bind: address already in use; failed to shutdown pipelines: no existing monitoring routine is running
2024/04/24 22:03:22 collector server run finished with error: failed to start extensions: failed to bind to address 0.0.0.0:13133: listen tcp 0.0.0.0:13133: bind: address already in use; failed to shutdown pipelines: no existing monitoring routine is running

What version did you use? v0.98.0

Environment go version go1.22.0 linux/amd64

SeanPMiller avatar Apr 24 '24 22:04 SeanPMiller

As a workaround, I have switched to using the following function:

func temporaryDryRunPatch(settings otelcol.CollectorSettings, ctxt context.Context) (err error) {
    factories, err := settings.Factories()
    if err != nil {
        return fmt.Errorf("failed to initialize factories: %w", err)
    }   

    confProv, err := otelcol.NewConfigProvider(settings.ConfigProviderSettings)
    if err != nil {
        return fmt.Errorf("failed to instantiate configuration provider: %w", err)
    }   
    defer func() {
        if shutdownErr := confProv.Shutdown(ctxt); shutdownErr != nil {
            if err != nil {
                err = fmt.Errorf("configuration-provider shutdown failed after encountering error: shutdown_err=%w, original_err=%w", shutdownErr, err)
            } else {
                err = fmt.Errorf("configuration-provider shutdown failed: %w", shutdownErr)
            }   
        }   
    }() 

    cfg, err := confProv.Get(ctxt, factories)
    if err != nil {
        return fmt.Errorf("failed to get config: %w", err)
    }   

    return cfg.Validate()
}

This gives me the behavior I want, so this is not an especially high-priority issue.

SeanPMiller avatar Apr 24 '24 22:04 SeanPMiller

Thanks for reporting! Seems like this was caused by #9257 cc @djaglowski @ycombinator

I don't quite understand why it happens though, my expectation would be that until we call Service.Start no extension actually registers anything. @SeanPMiller, two questions for you:

  1. Does this only happen with the healthcheck extension? Or can you see this happening with other components?
  2. Can you confirm that this does not happen in v0.97.0?

mx-psi avatar Apr 25 '24 11:04 mx-psi

I'll work on a demonstration program. Having a day job is a drag.

SeanPMiller avatar Apr 25 '24 16:04 SeanPMiller

https://github.com/SeanPMiller/otelcol-issue-10031

This is essentially condensed from otelcorecol, except for the custom root command.

SeanPMiller avatar Apr 25 '24 17:04 SeanPMiller

Specific answers:

  1. I see this in production with both the selfmon Prometheus telemetry service and also with healthcheck extension. The log shows, for example, the following:
    2024-04-25T18:19:24.987Z	info	[email protected]/telemetry.go:55	Setting up own telemetry...
    2024-04-25T18:19:24.987Z	info	[email protected]/telemetry.go:97	Serving metrics	{"address": ":8888", "level": "Basic"}
    2024-04-25T18:19:24.990Z	info	[email protected]/telemetry.go:55	Setting up own telemetry...
    2024-04-25T18:19:24.990Z	info	[email protected]/telemetry.go:97	Serving metrics	{"address": ":8888", "level": "Basic"}
    
  2. I can confirm that this does not happen in v0.97.0, as demonstrated by the reproduction above.

SeanPMiller avatar Apr 25 '24 18:04 SeanPMiller

Thanks for the repro! I see how this can fail with the Prometheus endpoint, I am still a bit puzzled with the healthcheck case.

mx-psi avatar Apr 26 '24 09:04 mx-psi

I am, too. I'll see if I can reproduce, but it only seems to happen with a sufficiently complex configuration.

SeanPMiller avatar Apr 26 '24 17:04 SeanPMiller

I think this issue in contrib around validate is related. The kubeletstatreceiver's createMetricsReceiver function is being invoked when validate is used.

TylerHelmuth avatar Apr 29 '24 17:04 TylerHelmuth

Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?

ycombinator avatar May 02 '24 22:05 ycombinator

The kubeletstatreceiver's createMetricsReceiver function is being invoked when validate is used.

That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.

Perhaps we should revert https://github.com/open-telemetry/opentelemetry-collector/pull/9257 and reopen https://github.com/open-telemetry/opentelemetry-collector/issues/8007 for now, so we can take another stab at implementing validation in a different way?

Let's do this. It also seems like we have not set clear expectations on what components should do at creation time vs startup, so we should work on this as well.

mx-psi avatar May 03 '24 11:05 mx-psi

That's to be expected, isn't it? The problem there IMO is that a component should not be opening any files during creation, only during startup.

That may be true, and it is possible the k8s components have a bug, but the reason the user in that issue starting seeing an issue when using validate is because in https://github.com/open-telemetry/opentelemetry-collector/pull/9257 instead of only calling cfg.Validate we also now instantiate the components via validatePipelineCfg.

TylerHelmuth avatar May 03 '24 16:05 TylerHelmuth

I agree with reverting the change for now

TylerHelmuth avatar May 03 '24 16:05 TylerHelmuth

I'm unable to simply click the "Revert" button on https://github.com/open-telemetry/opentelemetry-collector/pull/9257 so instead I've created https://github.com/open-telemetry/opentelemetry-collector/pull/10078 manually reverting the relevant bits of that PR.

ycombinator avatar May 03 '24 16:05 ycombinator