azure-webjobs-sdk icon indicating copy to clipboard operation
azure-webjobs-sdk copied to clipboard

Distributed tracing in Azure Functions

Open lmolkova opened this issue 6 years ago • 12 comments

I'll go through the planned changes (split by several PRs) and describe high-level design and expected outcome.

Here is a messy prototype for all changes: https://github.com/lmolkova/azure-webjobs-sdk/commit/22a7affce96f0c34363b3746d0af5797a1c1c593

I'm open to any feedback and suggestions and want to make sure we are on the same page before I start the implementation

Motivation

We want to enable distributed tracing for Azure Functions using ApplicationInsights users are able to track calls to and from the Azure Functions and correlate it to the rest of telemetry reported by other applications.

As a result, we want to enable experiences like End-to-end transaction viewer and Application Map for all users automagically.

Goal 1: supported auto-tracking (triggers only are tracked now)

ApplicationInsights can track HTTP, ServiceBus, EventHub, SQL calls automagically if configured so. It's a component responsibility to participate in distributed tracing, i.e. when new components (e.g. ServiceGrid) implement specific instrumentation, they just light up - requiring no changes from Azure Functions.

Enabling current tracking capabilities gives us following level of support:

[UPDATE] -v1 and all bindings that are not supported in v2 are removed [UPDATE] -added correlation column for clarity

Type Trigger Input Output Calls Tracking E2E Correlation
Blob Storage ✔ (Http) :x: Function-to-function only
Cosmos DB ✔ (Http - Azure Functions use default configuration, i.e. all communication is Http based) N/A
Event Grid N/A (trigger only) N/A
Event Hubs ✔ in single dispatch sceanrios, TBD in multiple dispatch scenarios
Http ✔ W3C and Request-Id standards
Microsoft Graph Excel tables ✔ (Http) N/A
Microsoft Graph OneDrive files ✔ (Http) N/A
Microsoft Graph Outlook email ✔ (Http) N/A
Microsoft Graph Events ✔ (Http) :x:
Microsoft Graph Auth tokens ✔ (Http) N/A
Mobile Apps ✔ (Http) N/A
Notification Hubs ✔ (Http) N/A
Queue storage ✔ (Http) :x: Function-to-function only
SendGrid :x: :x:
Service Bus
Table storage ✔ (Http) N/A
Timer N/A (trigger only) N/A
Twilio ✔ (Http) N/A

Goal 2: Correlation.

  • Within the function: function call should be correlated to any tracked binding call

  • Beyond the function: if trigger contains some context for correlation (specific http header, ServiceBus message property, blob metadata, etc), we want to use it to correlate function telemetry to the rest. The context should also flow to the output.

Correlation is achieved using .NET System.Diagnostics.Activity that holds all tracing context. Anyone can access static Activity.Current to get current context.

Each component is responsible to start a new Activity and propagate the context to downstream service (Azure Storage does not support it though). I.e. ServiceBus knows how to serialize Activity into the message.

Component is also responsible for extracting Activity from the wire. I.e. ServiceBus knows how to extract Activity from the messsage.

So, we will leverage this in Functions.

The Plan

PR1: Dependency Injection (optional, but highly desirable)

Motivation

Now, configuration of applicationInsights is done via DefaultTelemetryClientFactory.

With enabling dependency collection and correaltion this is going to grow and become complicated.

It's also hard to tune for users.

Another issue is that if AppInsights enables new features, Azure Functions would need to change the code to enable them as well.

Proposal

Leverage Dependency Injection to instantiate and control lifetime of AppInsights instances. Hide it beyond IHostBuilder.ConfigureApplicationInsights (Configure/Add/Use) extension. Eventually, ApplicationInsights should provide such extension for non-Web Hosts and maintain it (now it works in AspNetCore SDK for web hosts).

PR2: Tracking dependency calls

Proposal

  • Enable DependencyTrackingTelemteryModule and enable it to track HTTP, ServiceBus, EventHub

  • Bump ServiceBus and EventHub to the stable versions that support tracing

  • Enable dependency calls filtering based on log category (user code, bindings, triggers, host):

  • [UPDATE] Enable W3C correaltion protocol for outgoing HTTP requests

As a result, users should be able to see all supported calls, and control verbosity for each category separately.

Host logs would not have any scope and would not be logged for now (which is easy to change).

Open questions:

  • dependency calls for triggers

  • store Activity.Id instead of guid in blob metadata: breaking change and have to be done carefully or not done at all

PR3: Enable auto-requests collection (function executions) with parent context

Motivation

ServiceBus (EventHubs in future) processing are instrumented already as well as HTTP Asp.NET Core hosting. I.e. ApplicaitonInsights could track such calls automagically.

Since each component knows how to extract trace context and set it on the Activity, we want to leverage it.

So, there are 3 types of triggers

  • auto-trackable: ServiceBus, EventHub (in future), Http (in future, more on this later)

  • non auto-trackable, but can extract parent context: EventHub (now), Blob, Azure Storage Queue

  • non trackable and no context: Timer

Proposal

  • we are going to enable automatic request collection

  • for triggers that could have parent context (EH, blobs, queues), extract Activity and store it in IFunctionInstance.

  • Obsolete ParentId and add ParentActivity on IFunctionInstance and other places where applicable.

  • when request tracking starts, based on the Activity.Current presence and ParentActivity presence in the scope, decide whether it's needed to track request and how to do it

  • move properties assignment to telemetry initializers, so that auto-tracked requests have all functions tags (name, invocation id, trigger reason, etc)

Open questions:

  • function serialization into the persistent queue (whether Activity have to be serialized)

  • trigger can have parent context AND AzureFunctions ParentId altogether - both should appear on the request telemetry

  • multiple EH messages/Blobs in a trigger: how to correlate

PR4: HTTP extension and context extraction

Motivation

HTTP request tracking could be automagical, but unfortunately, we cannot configure the listener for it now - it's an internal ApplicationInsights thing - we will expose it. But we still want to get tracing context from the incoming requests and stamp it on the telemetry.

Proposal

  • In the HTTP extension, parse headers of the incoming request.

  • Support W3C and Request-Id protocols

  • Set parsed context on IFunctionInstance.ParentActivity

  • [if possible in this iteration] Support multiple instrumentation keys scenarios (AppInsights sets some headers to possible to correlate telemetry across different apps with different ikeys)


Future work outside of Azure functions: will be done later at some point (not blocking current effort)

  1. Splitting ApplicationInsights AspNetCore SDK into several packages:
  • one that supports generic DI and logging, no HTTP/Web
  • another one that adds Web and ASP.NET Core support
  • make multi-ikey and http listener configuration public
  1. Efficient filtering for dependency telemetry (ApplicationInsights SDK)
  2. ~~EventHub Processor instrumentation~~ [EDITED - does not make sense]

lmolkova avatar Jun 01 '18 00:06 lmolkova

@brettsam @paulbatum @fabiocav please review the plan and provide your comments. /cc @SergeyKanzhelev @zakimaksyutov

lmolkova avatar Jun 01 '18 00:06 lmolkova

Great writeup! Here's some comments:

Overall

If we're not going to backport this to v1.x, we should remove all those checkmarks so it's clear we're only doing this for v2.x.

PR1

I'm all for doing things the right way. Let's try to make this work. Some custom stuff we do today that I'd think we want to include:

  • Update the SDK version so it's trackable: client.Context.GetInternalContext().SdkVersion = $"webjobs: {assemblyVersion}";
  • Add our custom initializers.
  • Add our custom "post-QuickPulse" filtering. This was done to allow QuickPulse to show invocations/errors/etc, even if all logging was disabled or filtered.

PR2

looks good -- we can discuss the open questions separately

PR3

Some questions:

  • So these client libraries (EH, SB, etc) are responsible for setting Activity.Current when the message, etc, is received? Or is that something that WebJobs (or AI) will need to do?
  • What do you mean by 'function serialization into the persistent queue"?
  • I think we can start ignoring AzureFunctions ParentId in favor of this new parent. We don't expose it in Functions today, although WebJobs does.

PR4

Is there something that the HttpTrigger can do to allow this happen automatically? Which part is preventing that?

brettsam avatar Jun 01 '18 16:06 brettsam

@brettsam Thanks!

I updated the table to reflect v2 only.

PR1

I agree all current stuff (initializers, filters, SDK version) should be added into the DI and participate in TeletmeryConfiguration creation. Later, when AI SDK provides generic DI config, we'll still be able to adjust this config and add custom things

PR3

So these client libraries (EH, SB, etc) are responsible for setting Activity.Current when the message, etc, is received? Or is that something that WebJobs (or AI) will need to do?

Yes, SB and HTTP (and eventually EH) can extract Activity from the message and make it current.

SB do this already today - nothing is expected from WebJobs

EH can only extract, but not assign current, for EH WebJobs part would look like

var message = value.GetSingleEventTriggerInput(i);
var activity = message.Events.SingleOrDefault()?.ExtractActivity();
TriggeredFunctionData input = new TriggeredFunctionData
{
    ParentActivity = activity,
    TriggerValue = message
};
Task task = _parent._executor.TryExecuteAsync(input, _cts.Token);

ApplicaitonInsights logger would do following (instead of RequestTelemtery creation)

if (parentActivity != null && parentActivity != Activity.Current)
{
    stateValues[OperationContext] = _telemetryClient.StartOperation<RequestTelemetry>(parentActivity);
}

For Http, this functionality is enabled by specific listener we cannot use right now (see PR4 below). So in HttpTrigger we'd have similar code to extract Activity from HttpHeaders and pass it to the executor and logger.

Blob/queue trigger code would be similar to EH, except we'll assugn parent explicitly

Guid? parentId = await _causalityReader.GetWriterAsync(blob, cancellationToken);
Activity activity = new Activity("BlobQueueTriggerExecutor");
if (parentId != null)
    activity.SetParentId(parentId.ToString());

TriggeredFunctionData input = new TriggeredFunctionData
{
    ParentActivity = activity,
    TriggerValue = blob
};

return await registration.Executor.TryExecuteAsync(input, cancellationToken);

What do you mean by 'function serialization into the persistent queue"?

Using Activity instead of ParentId leads to changes in FunctionStartedMessage, FunctionStartedMessage, InstanceTableEntity. which are serialized into Azure Table and Queue. Is it correct it is used for logging only? I also expect it is somehow serialized and propagated to non .NET functions? I have not think about it yet, but we'll need to ensure correlation would work there as well.

PR4

Is there something that the HttpTrigger can do to allow this happen automatically? Which part is preventing that?

Instrumentation is only available in presence of the listener. ApplicationInsights ASP.NET Core SDK implements this listener, and it can be configured from the WebJobs (I was mistakenly thinking that is it not). But:

  1. it would require WebJobs.Logging.ApplicationInsights to depend on ApplicationInsights.AspNetCore SDK (that brings Microsoft.AspNetCore.Hosting and few more dependencies). Is it a problem?
  2. The instrumentation exists in Microsoft.AspNetCore.Hosting and it is not explicitly referenced. I can still see that Hosting is used with HttpTrigger, but I wonder how it is choosen and whether we can rely on this to be available. Maybe you can cast more light on how http is actually hosted?
  3. Asp.Net Core does not support W3C protocol yet (we are discussing it now, and may have support for it in 2.2). So the ASP.NET Core instrumentation would not solve everything and we'd need to support W3C on Trigger level now anyway

So all these concerns are solvable, but it seems re-using this auto-collection now may do more harm then good. I'd prefer if we configured it after AspNEt Core 2.2 is released and AppInsights AspNetCore SDK is split into seveal packages (so no unwanted dependencies introduced for WebJobs).

Assuming this, in HttpTrigger, we'd just parse headers and set the on the Activity and pass it to the executor. Executor will pass it to AppInsights and it knows what to do with the Activity.

lmolkova avatar Jun 01 '18 21:06 lmolkova

Note -- we moved this to Active Questions so it falls off of our "Untriaged" query. It's more of a "spec" and not really a direct issue.

brettsam avatar Jul 05 '18 18:07 brettsam

Sorry, but I don't really understand the status of things here - the request-id from an incoming http request is still not picked up and tracked, right? Any idea when this might be released?

riha avatar Apr 15 '19 13:04 riha

@lmolkova -- can you update this with the current status?

brettsam avatar May 30 '19 17:05 brettsam

Do plans exists to add this functionality to OrchestrationTriggers? Right now an HttpTrigger picks up the Request-Id and sets up the Activity correctly. However, when subsequently calling an OrchestrationTrigger from a DurableOrchestrationClient the Activity.Current is a new, unrelated Activity in the OrchestrationTrigger function.

brhinescot avatar Jul 18 '19 17:07 brhinescot

Are there any plans to make Azure Storage Queue auto trackable?

I've tried out a few things to enable end-to-end correlation when Azure Queues are involved, but couldn't find a way to make it work. Even if I manually serialize a ParentID as part of my Queue payload, I can't find a way to use it to set the Request telemetry ParentID for the Function that consumes the Queue message.

I guess what I'm asking is if there are plans to:

  1. Automagically enable end-to-end tracking for Azure Queue messages across different systems
  2. Allow the ability to set the ParentID manually in the Function telemetry, so I can serialize a ParentID in the queue payload and set it myself.

Thanks!

marknblueswitch avatar Aug 05 '19 22:08 marknblueswitch

Seconding @marknblueswitch

I guess what I'm asking is if there are plans to:

1. Automagically enable end-to-end tracking for Azure Queue messages across different systems

2. Allow the ability to set the ParentID manually in the Function telemetry, so I can serialize a ParentID in the queue payload and set it myself.

This is something that would be extremely useful - we are looking to use Storage Queues over Service Bus due to the simplicity (Service Bus has a lot of extra features we just don't need, and doesn't have any local development/emulator support) but the lack of distributed tracing is a real bummer. It would be really nice to provide some sort of mechanism to allow setting the ParentID from a property in the JSON serialized message body.

Are there any plans to support something like this or another mechanism of distributed E2E tracing with storage queues?

ssa3512 avatar Mar 15 '21 00:03 ssa3512

@fabiocav @lmolkova is distributed tracing supposed to currently work for function to function queue messages? I ran across QueueCausalityManager which sets the $AzureWebJobsParentId value in the JSON being sent into the queue, but I don't see the app insights request associated with the second function execution being correlated. I was trying to see if adding $AzureWebJobsParentId with the operation ID of an external operation would also cause correlation, but it doesn't even appear to currently happen for function to function messages in the same app.

ssa3512 avatar Mar 15 '21 02:03 ssa3512

In queue storage & blob storage - Function-to-function only E2E logging means what? if it offer distributed log tracing and how to achieve it ?

I created this repository to reproduce this issue.

https://github.com/dlebee/corrolation-function-servicebus

More and more library are trying to implement OpenTelemetry in C# which heavily depends on higher version of System.Diagnostics.DiagnosticSource

This is causing azure function service bus trigger correlation issues.


The following is what I think is causing the issue

<= 4.6.0

public partial class Activity
{
    public static Activity Current
    {
#if ALLOW_PARTIALLY_TRUSTED_CALLERS
[System.Security.SecuritySafeCriticalAttribute]
#endif
        get
        {
            ObjectHandle activityHandle = (ObjectHandle)CallContext.LogicalGetData(FieldKey);
            // Unwrap the Activity if it was set in the same AppDomain (as FieldKey is AppDomain-specific).
            if (activityHandle != null)
            {
                return (Activity)activityHandle.Unwrap();
            }

            return null;
        }
    }
}

VS

Since 4.7.1 >= or around there

public partial class Activity
{
    private static readonly AsyncLocal<Activity> s_current = new AsyncLocal<Activity>();

    /// <summary>
    /// Gets or sets the current operation (Activity) for the current thread.  This flows 
    /// across async calls.
    /// </summary>
    public static Activity Current
    {
        get { return s_current.Value; }
        set
        {
            if (ValidateSetCurrent(value))
            {
                SetCurrent(value);
            }
        }
    }
}

dlebee avatar Feb 25 '22 22:02 dlebee