opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[pkg/ottl] Parse uri string to url.* SemConv attributes

Open michalpristas opened this issue 1 year ago • 4 comments
trafficstars

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

parsing URI possible only using complex RegEx

Describe the solution you'd like

Intended converter parses a Uniform Resource Identifier (URI) string and extracts its components into URL SemConv attributes.

Example

input:

"sample": "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment"

Result:

{
  "url": {
    "path" : "/foo.gif",
    "fragment" : "fragment",
    "extension" : "gif",
    "password" : "mypassword",
    "original" : "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment",
    "scheme" : "http",
    "port" : 80,
    "user_info" : "myusername:mypassword",
    "domain" : "www.example.com",
    "query" : "key1=val1&key2=val2",
    "username" : "myusername"
  }
}

Describe alternatives you've considered

No response

Additional context

No response

michalpristas avatar Apr 16 '24 11:04 michalpristas

Pinging code owners:

  • pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Apr 16 '24 11:04 github-actions[bot]

@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns?

TylerHelmuth avatar Apr 17 '24 03:04 TylerHelmuth

removed my previous comment, as it was for other issue 🤦

it is surely possible to do so. but this would mean we require every user to use and compile regex on their own. my goal here is ease of use, this one is pretty frequent in our pipelines and i see using ExtractPattern with regex as very error prone. having something working and tested (using UT) will surely benefit our users

michalpristas avatar Apr 22 '24 08:04 michalpristas

I would be okay adding this. We can use the url.Parse stdlib function instead of having regular expressions either in a statement or in our code. In general I think the fewer regexes we and our users have, the better.

evan-bradley avatar May 09 '24 18:05 evan-bradley

I would be okay adding this. We can use the url.Parse stdlib function instead of having regular expressions either in a statement or in our code. In general I think the fewer regexes we and our users have, the better.

I see the value this provides to our end-users, but we don't have a good way right now to decide when a new function should be accepted vs an existing function should be used. We really need some criteria for how new functions get added. @michalpristas you've been really active in OTTL recently providing a lot of good feedback and ideas. Any willingness to try and right up those rules? It would really help with several of the function additions you've requested.

TylerHelmuth avatar May 16 '24 19:05 TylerHelmuth

Missing functionality is a strong yes e.g Sorting array, Append, missing converters for SHA512 or MD5.

For other cases I try to assume non technical user, so yes for me is when new function significantly improves UX or provides a more performant way of achieving same result. (Also a bit opinionated, if I have to write regex that is more than 10 characters I don't want to do that)

Example of improved UX is probably #32593 This issue here I think provides better UX and does not use regex so maybe even perf is better (I say maybe because don't have measurement)

If UX is same but assumes less technical user (better naming, terminology...) I'd say it's a maybe. This should be evaluated per use-case and there should be only a few requests/issues like that.

When UX is worse or performance is troublesome, this should be a strong no and there should be no issues like that.

Other way I can think about that is that this should be treated like a code. We should aim for readability first and performance second (meaning performant solution can be achieved by using more low level pieces)

With readability same things as with code should be expected, few important that come to mind

  • short, meaningful and descriptive names
  • naming consistent across functions
  • avoiding deep nesting fn1(target, fn2(fn3(fn4...)))) for frequent use cases
  • single responsibility & DRY

michalpristas avatar May 17 '24 13:05 michalpristas

@michalpristas can you open a PR with these suggestions in ottlfunc README at the bottom so we can continue the discussion and codify the rules.

TylerHelmuth avatar May 17 '24 16:05 TylerHelmuth

I opened a pr. I will be on pto for the next 10 days though so my response time may be slightly unresponsive

michalpristas avatar May 20 '24 07:05 michalpristas

A couple thoughts:

  1. We have a stanza URI parser already. We've been able to share code between OTTL functions and stanza parsers in some other cases so I would like to consider what it looks like to do this here as well.
  2. The stanza parser may have been written before the semantic conventions. If there's a mismatch, I support migrating that code to match semantic conventions with a feature gate to allow users a chance to prepare.
  3. The result proposed in this issue is structured but the semantic conventions are flat. I believe elastic considers these equivalent but OTel does not. (I may be wrong but this is my understanding.) Therefore I believe the correct result would be:
{
  "url.path" : "/foo.gif",
  "url.fragment" : "fragment",
  "url.extension" : "gif",
  "url.password" : "mypassword",
  "url.original" : "http://myusername:[email protected]:80/foo.gif?key1=val1&key2=val2#fragment",
  "url.scheme" : "http",
  "url.port" : 80,
  "url.user_info" : "myusername:mypassword",
  "url.domain" : "www.example.com",
  "url.query" : "key1=val1&key2=val2",
  "url.username" : "myusername"
}

djaglowski avatar May 29 '24 13:05 djaglowski

I've been writing a lot of OTTL lately and I found myself wanting exactly this functionality as a user convenience.

Fully-compliant URL parsing is definitely non-trivial, and IMO worth implementing centrally.

cdanis avatar Jun 12 '24 18:06 cdanis

Closing. Addressed in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32906

michalpristas avatar Jun 17 '24 13:06 michalpristas