daml icon indicating copy to clipboard operation
daml copied to clipboard

Improve public accessibility to our trace tooling

Open daravep opened this issue 3 years ago • 6 comments
trafficstars

Canton deployments support trace context propagation: https://docs.daml.com/canton/usermanual/monitoring.html#tracing

As a result, every command will be attached to a trace-id that is propagated between all distributed components to simplify debugging.

In fact, the trace contexts can even be generated in applications and propagated through GRPC, the Ledger API server into the Canton synchronization process.

Also, the not-officially-supported scala ledger bindings support context propagation, as used by Canton internally: https://github.com/digital-asset/canton/blob/main/community/common/src/main/scala/com/digitalasset/canton/ledger/api/client/CommandSubmitterWithRetry.scala#L75

However, there are several gaps:

  • [ ] the trace context is not included in the logs of the ledger api server
  • [ ] the trace context is only propagated to canton on the write path. There is no propagation on the read path, i.e. once canton emits a transaction to the ledger api server.
  • [ ] #14258
  • [ ] there is zero public available documentation on how to add a trace context directly to GRPC: https://docs.daml.com/app-dev/grpc/index.html#use-the-ledger-api-with-grpc

The current work-around is to find a "rosetta stone like" log message that mentions the "command-id" as well as a trace-id and then to filter for both. However, that only works for command submissions, not for e.g. party allocations and is sub-optimal.

daravep avatar Jun 23 '22 09:06 daravep

I spun off #14258 for the Application Runtime team. Please keep in mind that we need to prioritize according to our available resources.

stefanobaghino-da avatar Jun 23 '22 12:06 stefanobaghino-da

@stefanobaghino-da Thanks. I opened this based on a request by a client in order to capture the current state, not to suggest immediate action. Please discuss the priority of this with the product manager in charge. Oh, that's you! ;-)

daravep avatar Jun 23 '22 12:06 daravep

@daravep I am working on #12798 where I will forward the trace_id from json API to ledger API. As I understand there is no doc at the moment, may I ask if you can point me to the code that is related to passing trace context in grpc? I would like to understand how to pass in trace context when interacting with the ledger. Thanks

chunlokling-da avatar Sep 09 '22 10:09 chunlokling-da

@chunlokling-da Thanks for digging into this. Please don't forget to make your conclusions available to others too.

I've left the pointer that I know in the original description of this issue: https://github.com/DACH-NY/canton/blob/main/community/participant/src/main/scala/com/digitalasset/canton/participant/ledger/api/client/CommandSubmitterWithRetry.scala#L75

You can also observe that it happens here: https://github.com/digital-asset/daml/blob/main/ledger/ledger-api-client/src/main/scala/com/digitalasset/ledger/client/services/commands/CommandSubmissionFlow.scala#L25

On the Canton side, @danilofaria did all the work to get this working. He understands it much better, so I would suggest you talk to him (or use git blame to figure out person who wrote the command submission flow ...)

daravep avatar Sep 09 '22 12:09 daravep

@daravep Thank you so much for the Links!

I had a look at the code here: https://github.com/digital-asset/daml/tree/4064e992f369f7f8e52dfd8803388d7fa56bb6f9/ledger/participant-integration-api/src/main/scala/platform/apiserver/services

I can only find there is a telemetryContext: TelemetryContext in the ApiSubmissionService but not in other services (eg ApiCommandService.scala) Does it mean that only ApiSubmissionService has the trace context support?

ApiSubmissionService.scala

chunlokling-da avatar Sep 12 '22 14:09 chunlokling-da

I think so. @mziolekda might know better how far tracing is supported in the ledger API server.

daravep avatar Sep 13 '22 07:09 daravep

Yes, that is correct, only a bunch of services support it:

  • CommandSubmissionService
  • ConfigManagementService
  • PackageManagementService
  • PartyManagementService

To cover additional services would take a mid size project. The problem with the entire tracing solution is that it is a bit ad hoc at the moment. Bits of it have been implemented to suit a particular immediate testing need of a particular DA team. It has not been addressed comprehensively from the client perspective. I feel like that would be required here. Only then would we avoid the trap of implementing it partially or omitting some components all-together for example the json-api etc.

mziolekda avatar Sep 27 '22 07:09 mziolekda

Yes, that is correct, only a bunch of services support it:

  • CommandSubmissionService
  • ConfigManagementService
  • PackageManagementService
  • PartyManagementService

To cover additional services would take a mid size project. The problem with the entire tracing solution is that it is a bit ad hoc at the moment. Bits of it have been implemented to suit a particular immediate testing need of a particular DA team. It has not been addressed comprehensively from the client perspective. I feel like that would be required here. Only then would we avoid the trap of implementing it partially or omitting some components all-together for example the json-api etc.

Just to confirm, based on your list: do the CommandCompletionService and the CommandService not support tracing?

stefanobaghino-da avatar Oct 03 '22 14:10 stefanobaghino-da

@mziolekda Ping on the comment above since it blocks #12798.

stefanobaghino-da avatar Oct 04 '22 08:10 stefanobaghino-da

@skisel-da confirmed that neither service supports tracing right now. #12798 remains blocked for the time being.

stefanobaghino-da avatar Oct 04 '22 08:10 stefanobaghino-da

@mziolekda How do you prefer to keep track of the work that would be required to expand the tracing capabilities of the Ledger API server? This issue has just been created about it.

stefanobaghino-da avatar Oct 10 '22 13:10 stefanobaghino-da

I have created an epic for it on our backlog: https://digitalasset.atlassian.net/browse/DPP-1275 It is not planned for Q4. If you feel it should be addressed urgently, a priority call must be made by Bernhard and Ratko.

mziolekda avatar Oct 13 '22 12:10 mziolekda

Just to add my two cents here, I am currently improving documentation around tracing in Canton and describing how we trace a Canton ping.

The ping is a series of 3 commands:

  • participant1's party submits a command to create a Ping contract
  • participant2's party listens to the transactions and reacts to the creation of the Ping contract by exercising the Respond consuming choice, which results in the creation of a Pong contract
  • participant1's party listens to the transactions and reacts to the creation of the Pong contract by exercising the Ack consuming choice on it, which finalizes the process.

Ideally, we would like the 3 steps to be part of the same trace but even though command submission does support passing a telemetryContext, the transaction object we get from the transaction source has no telemetry information propagated. We end up with the traces broken and with part of the ping trace missing.

I have to add a caveat in the Canton documentation that this is a current limitation.

Although this is not incredibly urgent, it would be great to get it fixed.

FYI @mziolekda @gerolf-da @stefanobaghino-da

danilofaria avatar Oct 18 '22 20:10 danilofaria