opentelemetry-collector
opentelemetry-collector copied to clipboard
Make DataType a separate type within Component
Addresses https://github.com/open-telemetry/opentelemetry-collector/issues/9429
There were several approaches to implementing this, but I decided to make Type an interface that continues to be embedded in ID.
I split Type into ComponentType and DataType - ComponentType represents the names of components and DataType is an enum for Traces, Logs, and Metrics.
This keeps ID with the same basic API, which eases this transition. The largest change is NewType will automatically create a DataType instead of ComponentType for the strings "traces", "logs", and "metrics". In addition, ID's zero value now contains a null value for Type - since it is now an interface.
Codecov Report
Attention: Patch coverage is 80.76923% with 5 lines in your changes are missing coverage. Please review.
Project coverage is 91.63%. Comparing base (
671dcbc) to head (c6b2623).
| Files | Patch % | Lines |
|---|---|---|
| component/config.go | 68.75% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #10069 +/- ##
==========================================
- Coverage 91.65% 91.63% -0.02%
==========================================
Files 356 356
Lines 16843 16855 +12
==========================================
+ Hits 15437 15445 +8
- Misses 1063 1067 +4
Partials 343 343
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR seems to add complexity but I'm not clear what the benefit is. The original issue suggested moving DataType (logs, metrics, traces) out of the Component package, which I agree with. IMO, if we're reworking this, the most intuitive model is to have signal.Type (logs, metrics, traces), and component.Type (otlp, filelog, etc), and component.ID which is a pair of signal.Type and component.Type.
component.IDwhich is a pair ofsignal.Typeandcomponent.Type.
A component.ID currently is a pair of component.Type and a (possibly empty) name for the component (e.g. otlp/name). We use the fact that component.DataType is also a component.Type to also use it for pipeline names (e.g. traces/sampled). Making it a pair of signal.Type and component.Type wouldn't work
A
component.IDcurrently is a pair ofcomponent.Typeand a (possibly empty) name for the component (e.g.otlp/name). We use the fact thatcomponent.DataTypeis also acomponent.Typeto also use it for pipeline names (e.g.traces/sampled). Making it a pair ofsignal.Typeandcomponent.Typewouldn't work
Ah, you're right, crossed wires. My suggested alternative was hastily written.
I am primarily concerned about the proposal to have both component.Type and component.ComponentType. This introduces a problem which is in my opinion worse than the one we're trying to fix.
If we're going to rework all of this, it would make sense to me to move data type into a different package, since things other than components can be types.
I totally agree! I have a followup PR planned where I will move it out. Didn't make sense for me to do it all in one PR. I also plan to rename it to SignalType to better match the rest of OpenTelemetry.
I am primarily concerned about the proposal to have both component.Type and component.ComponentType. This introduces a problem which is in my opinion worse than the one we're trying to fix.
I can totally see the concern here - Type was already pretty confusing to understand. However I do think that the idea of having an interface like Type that we can fulfill with either ComponentType or DataType is a good one. Very open to other names... I'll brainstorm but honestly I don't have any good ones right now.
I think something I should have done in the beginning is lay out the problem and how I arrived at this solution:
If we want to make DataType its own type - we have to adjust Type somehow. As Pablo said:
A component.ID currently is a pair of component.Type and a (possibly empty) name for the component (e.g. otlp/name). We use the fact that component.DataType is also a component.Type to also use it for pipeline names (e.g. traces/sampled).
We have to somehow change either ID or Type to make this work.
I considered a few options:
-
Creating a
DataTypeIDjust for Pipelines - containing aDataTypeandstring name. Other components would just use the same ID as they do today. I decided against this because its a little ugly to have both a DataTypeID and ID. I'd have to change a lot of pipeline and connector code to make this work - and there might have been API changes as well. -
I could make
DataTypeandTypeshare an interface such thatIDwas a pair ofThisCommonInterface(this a placeholder name) and astring name. I decided against this because it would be an invasive change to theIDAPI - specifically changing theID.Typemethod to returnThisCommonInterfacewould affect a huge amount of code. -
Making
Typean interface. This is relatively easy to do, and not much code outside of component/ has to changed. The API stays the same and its logical to haveComponentTypeandDataTypeshare an interface.
I hope this explains how I decided on approach 3. Is this approach something that everyone agrees with? If we can get to consensus on the overall approach I can finish iterating on this PR and get it to a mergeable state.
From the Collector stability discussion, we decided to first focus in if the approach of having an interface and two types that implement said interface (the type for component types and the type for signal types) makes sense, and decide on what names to use for said types/interfaces later.
This feels like a pretty confusing topic, so I'm gonna try to define a bunch of terms for use in this comment.
Currently we have the following important terms:
Kind: represents the kind of component, such asreceiverorexporter.Type: The unique identifier for a component. This value MUST be unique betweenKindof components. Ex:otlphttp,transform.Component ID: this is the full name of the component that user defines in their configuration. It is made up of theTypeof components and optional string we callname.type[/name]. This value MUST be unique betweenKindof components. Ex:otlphttp/honeycomb,transform/k8s-events.DataType: Is a specialTypethat represents a OTel Signal, currentlytraces,metrics, orlogs. An important consequence ofDataTypebeing aTypeis that it can be used in aComponent ID. We currently use this to make named pipelines, such astraces/honeycomb.
With those terms defined, I want to poke at why we use Component ID to represent both a configured component, such as otlphttp/honeycomb, and a pipeline, such as traces/honeycomb. While they are technically both identifiers in the config, it feels to me like they server very different purposes. otlphttp/honeyomb identifies a specific component within the config and is used as the identifier for its placement in the collector's pipeline(s). traces/honeycomb is a specific pipeline and expects that Component IDs will be added to the different Kind lists (receivers, processors, exporters).
Importantly I think these two identifiers need their own naming restrictions. A Type is restricted to the regex ^[a-zA-Z][0-9a-zA-Z_]{0,62}$. That is a great regex when Type is being used to identify a component, but not when identifying a pipeline. For a pipeline DataType needs to be restricted to traces, metrics, and logs exactly. We do this with some special DataType vars, but there isn't anything in our APIs that enforces that the Component IDs used in the pipelines.Config struct be one of these vars.
I feel we'd really benefit from stopping the use of Component ID to represent both a component and a pipeline. I think Component ID and Type should refer to only components and that we should introduce a new concept of Signal and Pipeline ID. Similar to Component ID, Pipeline ID would be made up of a Signal and an optional string name. signal[/name]. Signal would be force to be traces, metrics or logs.
With this solution we have no more overlap between names and no mixing of concepts in the structs. I will admit that I only took a brief look through service/internal/graph and I'm guess it causes some havoc in there, but I think the separation of concerns will be worth it.
@TylerHelmuth's analysis makes sense to me. There is commonality in the structure of type[/name] but it's simple enough that sharing any kind of dependency between the two is really just unnecessary complication.
I will admit that I only took a brief look through service/internal/graph and I'm guess it causes some havoc in there, but I think the separation of concerns will be worth it.
I could be forgetting some details but I think it may be less complicated than the changes in this PR.
I also think that the implementation of a Pipeline ID and Signal would not (should not?) live in component. If component exists as a definition of what a Component is in the collector, I do not think we need to "leak" pipeline concepts into the package. We should probably define Signal in a config module and Pipeline ID in service somewhere.
Hey @TylerHelmuth, thanks for the comprehensive feedback! This was definitely an option I considered - see option 1 that I mentioned in my previous comment.
I can implement this other option - there will be a bunch of changes in pipeline and connector code as well as perhaps some API changes - but this is probably the right time to make those changes! You also make a good point of the pipeline API silently accepting Types that aren't DataType - that's definitely something we can improve.
I think it'll be better if I close this PR and open a different PR for this change.
I also think that the implementation of a Pipeline ID and Signal would not (should not?) live in component. If component exists as a definition of what a Component is in the collector, I do not think we need to "leak" pipeline concepts into the package. We should probably define Signal in a config module and Pipeline ID in service somewhere.
I agree with this! My original plan was to
- change the backing type for DataType
- Move DataType to another package
- Rename DataType to Signal or something like that
I will say I'm not sure Signal should live in a config module - I think that there should be a generally available enum for the different Signal's supported by the collector - and as far as I can tell DataType is the only option.
Opened https://github.com/open-telemetry/opentelemetry-collector/pull/10222 as an alternative to this PR, taking into account the feedback from this thread. Please take a look!