nydus
nydus copied to clipboard
nydus performance tracing
We need a mechanism to collect the time and resource consumption of nydus at each stage of build time and run time to give us insight into the performance impact of development iterations, identify performance bottlenecks, etc. The main requirements are as follows.
- The ability to count the time consumed by nydus-image/nydusd at each stage in span format.
- Statistics on memory usage, CPU, and network requests.
- The ability to present performance data in report form and compare it with the previous result.
It should be noted that when adding a tracepoint, it is as least invasive to the code as possible and does not lose too much performance, or there is a switch to control the trace.
Referenceable tools or libraries:
- https://github.com/tikv/minitrace-rust
- https://opentelemetry.io/
- https://www.jaegertracing.io/
- https://developer.aliyun.com/article/799040
Is there anyone working on this? Maybe I can help with it. @imeoer
Is there anyone working on this? Maybe I can help with it. @imeoer
Hi Qi, I'm working on it. But we can collaborate.
@yawqi @wraymo Thanks! We can split it into the two sub-tasks, "trace collection" and "cpu/memory/network requests statistics", Which one you prefer?
I'd prefer "trace collection" @imeoer
OK, I will look into the "cpu/memory/network requests statistics".
@wraymo @yawqi Thanks! Any specific design doc can be committed here first and discussed together.
Hello 👋 I haven't seen any of the tracing bits land yet, is there a plan to use something like opentel for this. Do you need anything, I am happy to help out on this as we could really use better insight into these bits, especially:
The ability to count the time consumed by nydus-image/nydusd at each stage in span format.
Hello 👋 I haven't seen any of the tracing bits land yet, is there a plan to use something like opentel for this. Do you need anything, I am happy to help out on this as we could really use better insight into these bits, especially:
The ability to count the time consumed by nydus-image/nydusd at each stage in span format.
@lilic Are you talking about network performance trace? You can try using the nydusctl metrics backend --sock /nydusd.sock
command to inspect the number of HTTP requests called by nydusd, as well as metric info such as read time distribution.
@imeoer thank you! That is good to know! But we need a way to send tracing data so we can see performance over time, so open-tel tracing for example or more metrics would be useful for us.
@lilic nydusd exported some metrics by an API, this is also the data source of nydusctl: https://github.com/dragonflyoss/image-service/blob/f3cdd071b01ea5d2086e376a7b2bfee3ee233360/api/openapi/nydus-api-v1.yaml#L241
But indeed we'd better make it get in open tracing, unfortunately no one is doing the job yet. :(
@lilic Here is an example to get the backend metric:
https://github.com/dragonflyoss/image-service/blob/f3cdd071b01ea5d2086e376a7b2bfee3ee233360/smoke/tests/tool/nydusd.go#L384C18-L384C18
@imeoer I see, I thought it was Prometheus metrics so I was confused why I couldn't find them when looking at /metrics
endpoint. Thanks for that. Sadly that doesn't work for me, still looking into it. But I get the following error when trying that:
~ # nydusctl --sock=/run/containerd-nydus/system.sock --raw metrics backend
Error: deserialize: trailing characters at line 1 column 5
We can move this to slack so I don't spam this issue.
I did want to look into adding tracing to at least our fork and if it works for us, then contributing it to here.