opentelemetry-java icon indicating copy to clipboard operation
opentelemetry-java copied to clipboard

Improve detection of resource attributes on ECS

Open felixscheinost opened this issue 3 years ago • 5 comments

This improves the detection of resource attributes on ECS by fetching ECS metadata from ECS_CONTAINER_METADATA_URI or ECS_CONTAINER_METADATA_URI_V4.

Previously only CONTAINER_NAME and CONTAINER_ID id were set.

Now we set:

  • CONTAINER_ID
  • CONTAINER_NAME
  • AWS_ECS_CONTAINER_ARN
  • CONTAINER_IMAGE_NAME
  • CONTAINER_IMAGE_TAG
  • aws.ecs.container.image.id
  • AWS_LOG_GROUP_ARNS
  • AWS_LOG_GROUP_NAMES
  • AWS_LOG_STREAM_NAMES
  • AWS_ECS_TASK_ARN
  • AWS_ECS_TASK_FAMILY
  • AWS_ECS_TASK_REVISION

Especially AWS_LOG_GROUP_ARNS is important so that connection of traces to logs works OOTB on X-Ray.

felixscheinost avatar Jun 30 '22 16:06 felixscheinost

Codecov Report

Base: 90.04% // Head: 90.60% // Increases project coverage by +0.56% :tada:

Coverage data is based on head (1fc24b4) compared to base (0860b38). Patch has no changes to coverable lines.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4574      +/-   ##
============================================
+ Coverage     90.04%   90.60%   +0.56%     
+ Complexity     5057     4906     -151     
============================================
  Files           582      567      -15     
  Lines         15580    14728     -852     
  Branches       1495     1414      -81     
============================================
- Hits          14029    13345     -684     
+ Misses         1096      959     -137     
+ Partials        455      424      -31     
Impacted Files Coverage Δ
...java/io/opentelemetry/sdk/trace/data/SpanData.java 75.00% <0.00%> (-25.00%) :arrow_down:
...esting/context/SettableContextStorageProvider.java 37.93% <0.00%> (-24.14%) :arrow_down:
...a/io/opentelemetry/sdk/trace/internal/JcTools.java 28.57% <0.00%> (-4.77%) :arrow_down:
...ernal/aggregator/ExplicitBucketHistogramUtils.java 97.29% <0.00%> (-2.71%) :arrow_down:
.../opentelemetry/sdk/internal/ComponentRegistry.java 90.90% <0.00%> (-2.43%) :arrow_down:
...o/opentelemetry/api/internal/ReadOnlyArrayMap.java 91.11% <0.00%> (-2.23%) :arrow_down:
...opentelemetry/opentracingshim/SpanBuilderShim.java 77.89% <0.00%> (-0.68%) :arrow_down:
...n/java/io/opentelemetry/sdk/logs/LogProcessor.java 85.71% <0.00%> (ø)
.../java/io/opentelemetry/sdk/logs/SdkLogEmitter.java 100.00% <0.00%> (ø)
...va/io/opentelemetry/sdk/logs/NoopLogProcessor.java 100.00% <0.00%> (ø)
... and 160 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jun 30 '22 16:06 codecov[bot]

Thanks so much for this improvement! I will have someone from AWS take a look at this shortly

willarmiros avatar Jul 06 '22 17:07 willarmiros

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jul 20 '22 17:07 github-actions[bot]

Bump: Please don't close this PR automatically.

felixscheinost avatar Jul 27 '22 07:07 felixscheinost

Thanks so much for this improvement! I will have someone from AWS take a look at this shortly

@willarmiros any update here?

jkwatson avatar Jul 27 '22 13:07 jkwatson

Apologies for the delay here, we will be having someone review this shortly!

willarmiros avatar Aug 09 '22 16:08 willarmiros

@felixscheinost i missed out on this PR and worked on #4670. Your PR is effectively a superset of mine, with the exception of AWS_ECS_LAUNCHTYPE and AWS_LOGS_STREAM_ARNS and a couple small bugs in the generation of the AWS Log Group ARN. I prepared a fix for your PR in https://github.com/lumigo-io/opentelemetry-java/tree/pr-for-felix. @willarmiros is going to have a look at these PRs in the next few days. Would you be interested in pulling into your PR my changes? :-)

mmanciop avatar Aug 09 '22 18:08 mmanciop

@mmanciop Thanks for rebasing your changes on my PR!

Regarding the change in the log group ARN. That dosn't seem right to me. Have you tested this on AWS?

I just changed this line on my branch, to test this change in isolation, and deployed to our test evironment.

// from
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:" + logGroupName);
// to
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:/aws" + logGroupName + ":*");

With this change in place I can no longer see the logs for a trace in CloudWatch.

If I view a trace, for which I previously could view logs, and click on "Resources" for "Cloudwatch logs" I see the following:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws<the name of my log group>:*"
  }
]

I also tried adding another separting / between /aws and logGroupName but that also doesn't work. In CloudWatch I then see:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws/<the name of my log group>:*"
  }
]

But that is also wrong as my log group is just called <the name of my log group> and not /aws/<the name of my log group>.

felixscheinost avatar Aug 12 '22 08:08 felixscheinost

@mmanciop Thanks for rebasing your changes on my PR!

Regarding the change in the log group ARN. That dosn't seem right to me. Have you tested this on AWS?

I just changed this line on my branch, to test this change in isolation, and deployed to our test evironment.

// from
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:" + logGroupName);
// to
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:/aws" + logGroupName + ":*");

With this change in place I can no longer see the logs for a trace in CloudWatch.

If I view a trace, for which I previously could view logs, and click on "Resources" for "Cloudwatch logs" I see the following:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws<the name of my log group>:*"
  }
]

I also tried adding another separting / between /aws and logGroupName but that also doesn't work. In CloudWatch I then see:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws/<the name of my log group>:*"
  }
]

But that is also wrong as my log group is just called <the name of my log group> and not /aws/<the name of my log group>.

I compared the ARNs with the output of aws logs describe-log-groups and aws logs describe-log-streams. At first, when I started investigating the change, I was as surprised as you. @willarmiros do you concur with these ARN values?

mmanciop avatar Aug 12 '22 12:08 mmanciop

@mmanciop When I do describe-log-groups on my log groups I get an ARN without /aws.

Only when the log group name contains /aws, like for lambdas, does the ARN also contain /aws.

felixscheinost avatar Aug 12 '22 14:08 felixscheinost

@mmanciop When I do describe-log-groups on my log groups I get an ARN without /aws.

Only when the log group name contains /aws, like for lambdas, does the ARN also contain /aws.

I tried it on ECS, but this bears more validation. Brb :-)

mmanciop avatar Aug 12 '22 15:08 mmanciop

So, what I see in the ECS console when deploying a simple CDK + Amazon ECS project is this:

awslogs-group | /aws/batch/job
-- | --
awslogs-region | eu-central-1
awslogs-stream-prefix | CrawlerJobDefinition4D5-55fb222b188f18e

The log-group is called /aws/batch/job in CloudWatch and the ARN is arn:aws:logs:eu-central-1:170906415945:log-group:/aws/batch/job:*.

The AWS CLI reports:

% aws logs describe-log-groups | jq '.logGroups[] | select(.logGroupName == "/aws/batch/job")'
{
  "logGroupName": "/aws/batch/job",
  "creationTime": 1658054662260,
  "metricFilterCount": 0,
  "arn": "arn:aws:logs:eu-central-1:170906415945:log-group:/aws/batch/job:*",
  "storedBytes": 141110571
}

So, yes, we should not add programmatically /aws. The :* at the end of the ARN for the log-group, however, seem pretty legit. Do you agree @felixscheinost ?

mmanciop avatar Aug 12 '22 15:08 mmanciop

@felixscheinost I pushed a commit to remove the spurious /aws prefix from the ARNs. Thanks for catching that :-)

mmanciop avatar Aug 12 '22 15:08 mmanciop

The :* at the end of the ARN for the log-group, however, seem pretty legit.

Yeah, I think so. It seems like it does work without it but both the CLI and aws.amazon.com display log groups with that trailing :*.

Okay, I will manually test the other changes from your commit after the weekend and if everything looks fine (which I assume it will), pull them into this PR.

Is it okay for you if I squash your two commits into one and push it on top of this branch?

felixscheinost avatar Aug 12 '22 17:08 felixscheinost

Is it okay for you if I squash your two commits into one and push it on top of this branch?

Yup, I think in the end it is best to squash the entire PR on merge anyhow

mmanciop avatar Aug 12 '22 18:08 mmanciop

Hi @mmanciop, unfortunately even the trailing :* seems to break showing the logs for a trace.

I cherry picked all your changes, deployed and tried showing the logs for a trace but didn't see any. As sometimes the logs take a while to show up I waited and tried again but still no logs.

Then I removed the trailing :* from the log group ARNs, deployed again and now I can see the logs again.

The output for the trace under "Resources" -> "CloudWatch logs" is:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account number>:log-group:<log group name>:*"
  }
]

vs this:

[
  {
    "log_group": "<log group name>",
    "arn": "arn:aws:logs:eu-central-1:<account number>:log-group:<log group name>"
  }
]

So it seems CloudWatch is very peculiar about how the log group ARNs are formatted. Maybe someone from AWS can verify this.

felixscheinost avatar Aug 17 '22 06:08 felixscheinost

I cherry picked all your changes, deployed and tried showing the logs for a trace but didn't see any. As sometimes the logs take a while to show up I waited and tried again but still no logs.

Then I removed the trailing :* from the log group ARNs, deployed again and now I can see the logs again.

@felixscheinost how exactly are you using these ARNs? With which tool? Because the issue could be there too.

Besides, the spec clearly states with the example that the :* must be there: https://opentelemetry.io/docs/reference/specification/resource/semantic_conventions/cloud_provider/aws/logs/

mmanciop avatar Aug 17 '22 09:08 mmanciop

I am using CloudWatch traces on the official AWS web console.

felixscheinost avatar Aug 17 '22 10:08 felixscheinost

Besides, the spec clearly states with the example that the :* must be there: https://opentelemetry.io/docs/reference/specification/resource/semantic_conventions/cloud_provider/aws/logs/

These are merely "examples".

This links also references the official AWS documentation, which states:

Both of the following are used. The second one, with the * at the end, is what is returned by the describe-log-groups CLI command and the DescribeLogGroups API. arn:aws:logs:region:account-id:log-group:log_group_name arn:aws:logs:region:account-id:log-group:log_group_name:*

I think in the end what works is what's correct. 3rd party tools which parse aws.log.group.arns can be adapted. AWS CloudWatch can hardly be adapted.

felixscheinost avatar Aug 17 '22 10:08 felixscheinost

I suppose @willarmiros, @Aneurysm9 should be making the call and, if :* has to go, update the examples in the spec. However, considering that in CloudWatch actually is showing even in the console ARNs with the :*, I think it is a bug that X-Ray should fix.

Screenshot 2022-08-17 at 13 20

mmanciop avatar Aug 17 '22 11:08 mmanciop

Sorry for being late here: so from my reading there were 2 issues:

  1. /aws was being pre-pended to log groups in the resource detector, but that was fixed to no longer do that (which is the correct fix)
  2. There is a question of whether to include the trailing :* in the log group ARN when recording it on the trace. It seems like the expected behavior is "it doesn't matter" because both such ARN formats are valid, however in practice only the ARN without the trailing :* works. I have just found the bug on our end here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/8a821bb7829a23924edd2e23b22c6751277ab2ea/exporter/awsxrayexporter/internal/translator/aws.go#L272-L279

I will submit a fix for the bug so that both ARN formats work. The spec should stay as-is because the examples are still valid formats, just our parsing code is wrong. However, for now it would be best if the resource detectors could record the ARNs without the trailing :* so that everything still works while we wait for the fix to land.

willarmiros avatar Aug 19 '22 15:08 willarmiros

Hi @mmanciop - I have opened the PR to address the :* issue on the X-Ray exporter side, and @jj22ee has left a few comments but approved. Would you mind addressing the comments so we can then merge?

willarmiros avatar Aug 29 '22 23:08 willarmiros

@willarmiros I cannot push to @felixscheinost's branch (I have to push rights for his fork, which he used to open this PR), but I prepared the commit on https://github.com/lumigo-io/opentelemetry-java/tree/pr-for-felix .

mmanciop avatar Aug 30 '22 07:08 mmanciop

I addressed @jj22ee's comments and cherry-picked @mmanciop's additions.

felixscheinost avatar Aug 30 '22 07:08 felixscheinost

@willarmiros

However, for now it would be best if the resource detectors could record the ARNs without the trailing :* so that everything still works while we wait for the fix to land.

I pushed another commit which removes the trailing :* for now.

felixscheinost avatar Aug 30 '22 07:08 felixscheinost

@willarmiros when you think this is ready, please approve and we can move forward. Thanks!

jkwatson avatar Sep 05 '22 16:09 jkwatson