opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Make DataType a separate type within Component

Open ankitpatel96 opened this issue 1 year ago • 10 comments

Addresses https://github.com/open-telemetry/opentelemetry-collector/issues/9429

There were several approaches to implementing this, but I decided to make Type an interface that continues to be embedded in ID.

I split Type into ComponentType and DataType - ComponentType represents the names of components and DataType is an enum for Traces, Logs, and Metrics.

This keeps ID with the same basic API, which eases this transition. The largest change is NewType will automatically create a DataType instead of ComponentType for the strings "traces", "logs", and "metrics". In addition, ID's zero value now contains a null value for Type - since it is now an interface.

ankitpatel96 avatar May 01 '24 23:05 ankitpatel96

Codecov Report

Attention: Patch coverage is 80.76923% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 91.63%. Comparing base (671dcbc) to head (c6b2623).

Files Patch % Lines
component/config.go 68.75% 5 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10069      +/-   ##
==========================================
- Coverage   91.65%   91.63%   -0.02%     
==========================================
  Files         356      356              
  Lines       16843    16855      +12     
==========================================
+ Hits        15437    15445       +8     
- Misses       1063     1067       +4     
  Partials      343      343              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 01 '24 23:05 codecov[bot]

This PR seems to add complexity but I'm not clear what the benefit is. The original issue suggested moving DataType (logs, metrics, traces) out of the Component package, which I agree with. IMO, if we're reworking this, the most intuitive model is to have signal.Type (logs, metrics, traces), and component.Type (otlp, filelog, etc), and component.ID which is a pair of signal.Type and component.Type.

djaglowski avatar May 09 '24 17:05 djaglowski

component.ID which is a pair of signal.Type and component.Type.

A component.ID currently is a pair of component.Type and a (possibly empty) name for the component (e.g. otlp/name). We use the fact that component.DataType is also a component.Type to also use it for pipeline names (e.g. traces/sampled). Making it a pair of signal.Type and component.Type wouldn't work

mx-psi avatar May 09 '24 20:05 mx-psi

A component.ID currently is a pair of component.Type and a (possibly empty) name for the component (e.g. otlp/name). We use the fact that component.DataType is also a component.Type to also use it for pipeline names (e.g. traces/sampled). Making it a pair of signal.Type and component.Type wouldn't work

Ah, you're right, crossed wires. My suggested alternative was hastily written.

I am primarily concerned about the proposal to have both component.Type and component.ComponentType. This introduces a problem which is in my opinion worse than the one we're trying to fix.

djaglowski avatar May 09 '24 21:05 djaglowski

If we're going to rework all of this, it would make sense to me to move data type into a different package, since things other than components can be types.

I totally agree! I have a followup PR planned where I will move it out. Didn't make sense for me to do it all in one PR. I also plan to rename it to SignalType to better match the rest of OpenTelemetry.

I am primarily concerned about the proposal to have both component.Type and component.ComponentType. This introduces a problem which is in my opinion worse than the one we're trying to fix.

I can totally see the concern here - Type was already pretty confusing to understand. However I do think that the idea of having an interface like Type that we can fulfill with either ComponentType or DataType is a good one. Very open to other names... I'll brainstorm but honestly I don't have any good ones right now.

ankitpatel96 avatar May 09 '24 23:05 ankitpatel96

I think something I should have done in the beginning is lay out the problem and how I arrived at this solution:

If we want to make DataType its own type - we have to adjust Type somehow. As Pablo said:

A component.ID currently is a pair of component.Type and a (possibly empty) name for the component (e.g. otlp/name). We use the fact that component.DataType is also a component.Type to also use it for pipeline names (e.g. traces/sampled).

We have to somehow change either ID or Type to make this work.

I considered a few options:

  1. Creating a DataTypeID just for Pipelines - containing a DataType and string name. Other components would just use the same ID as they do today. I decided against this because its a little ugly to have both a DataTypeID and ID. I'd have to change a lot of pipeline and connector code to make this work - and there might have been API changes as well.

  2. I could make DataType and Type share an interface such that ID was a pair of ThisCommonInterface (this a placeholder name) and a string name. I decided against this because it would be an invasive change to the ID API - specifically changing the ID.Type method to return ThisCommonInterface would affect a huge amount of code.

  3. Making Type an interface. This is relatively easy to do, and not much code outside of component/ has to changed. The API stays the same and its logical to have ComponentType and DataType share an interface.

I hope this explains how I decided on approach 3. Is this approach something that everyone agrees with? If we can get to consensus on the overall approach I can finish iterating on this PR and get it to a mergeable state.

ankitpatel96 avatar May 10 '24 17:05 ankitpatel96

From the Collector stability discussion, we decided to first focus in if the approach of having an interface and two types that implement said interface (the type for component types and the type for signal types) makes sense, and decide on what names to use for said types/interfaces later.

mx-psi avatar May 15 '24 15:05 mx-psi

This feels like a pretty confusing topic, so I'm gonna try to define a bunch of terms for use in this comment.

Currently we have the following important terms:

  • Kind: represents the kind of component, such as receiver or exporter.
  • Type: The unique identifier for a component. This value MUST be unique between Kind of components. Ex: otlphttp, transform.
  • Component ID: this is the full name of the component that user defines in their configuration. It is made up of the Type of components and optional string we call name. type[/name]. This value MUST be unique betweenKind of components. Ex: otlphttp/honeycomb, transform/k8s-events.
  • DataType: Is a special Type that represents a OTel Signal, currently traces, metrics, or logs. An important consequence of DataType being a Type is that it can be used in a Component ID. We currently use this to make named pipelines, such as traces/honeycomb.

With those terms defined, I want to poke at why we use Component ID to represent both a configured component, such as otlphttp/honeycomb, and a pipeline, such as traces/honeycomb. While they are technically both identifiers in the config, it feels to me like they server very different purposes. otlphttp/honeyomb identifies a specific component within the config and is used as the identifier for its placement in the collector's pipeline(s). traces/honeycomb is a specific pipeline and expects that Component IDs will be added to the different Kind lists (receivers, processors, exporters).

Importantly I think these two identifiers need their own naming restrictions. A Type is restricted to the regex ^[a-zA-Z][0-9a-zA-Z_]{0,62}$. That is a great regex when Type is being used to identify a component, but not when identifying a pipeline. For a pipeline DataType needs to be restricted to traces, metrics, and logs exactly. We do this with some special DataType vars, but there isn't anything in our APIs that enforces that the Component IDs used in the pipelines.Config struct be one of these vars.

I feel we'd really benefit from stopping the use of Component ID to represent both a component and a pipeline. I think Component ID and Type should refer to only components and that we should introduce a new concept of Signal and Pipeline ID. Similar to Component ID, Pipeline ID would be made up of a Signal and an optional string name. signal[/name]. Signal would be force to be traces, metrics or logs.

With this solution we have no more overlap between names and no mixing of concepts in the structs. I will admit that I only took a brief look through service/internal/graph and I'm guess it causes some havoc in there, but I think the separation of concerns will be worth it.

TylerHelmuth avatar May 15 '24 21:05 TylerHelmuth

@TylerHelmuth's analysis makes sense to me. There is commonality in the structure of type[/name] but it's simple enough that sharing any kind of dependency between the two is really just unnecessary complication.

I will admit that I only took a brief look through service/internal/graph and I'm guess it causes some havoc in there, but I think the separation of concerns will be worth it.

I could be forgetting some details but I think it may be less complicated than the changes in this PR.

djaglowski avatar May 16 '24 02:05 djaglowski

I also think that the implementation of a Pipeline ID and Signal would not (should not?) live in component. If component exists as a definition of what a Component is in the collector, I do not think we need to "leak" pipeline concepts into the package. We should probably define Signal in a config module and Pipeline ID in service somewhere.

TylerHelmuth avatar May 16 '24 03:05 TylerHelmuth

Hey @TylerHelmuth, thanks for the comprehensive feedback! This was definitely an option I considered - see option 1 that I mentioned in my previous comment.

I can implement this other option - there will be a bunch of changes in pipeline and connector code as well as perhaps some API changes - but this is probably the right time to make those changes! You also make a good point of the pipeline API silently accepting Types that aren't DataType - that's definitely something we can improve.

I think it'll be better if I close this PR and open a different PR for this change.

ankitpatel96 avatar May 17 '24 20:05 ankitpatel96

I also think that the implementation of a Pipeline ID and Signal would not (should not?) live in component. If component exists as a definition of what a Component is in the collector, I do not think we need to "leak" pipeline concepts into the package. We should probably define Signal in a config module and Pipeline ID in service somewhere.

I agree with this! My original plan was to

  1. change the backing type for DataType
  2. Move DataType to another package
  3. Rename DataType to Signal or something like that

I will say I'm not sure Signal should live in a config module - I think that there should be a generally available enum for the different Signal's supported by the collector - and as far as I can tell DataType is the only option.

ankitpatel96 avatar May 17 '24 21:05 ankitpatel96

Opened https://github.com/open-telemetry/opentelemetry-collector/pull/10222 as an alternative to this PR, taking into account the feedback from this thread. Please take a look!

ankitpatel96 avatar Jun 05 '24 14:06 ankitpatel96