bpftime
bpftime copied to clipboard
[FEATURE]High performance for ebpf helper function implementation on the gpu side.
Is your feature request related to a problem? Please describe.
It seems the current implementation of the ebpf helper function on the GPU side has some performance issues.
I have read the source code of this part. The essence of the helper functions is that the GPU reads the host side memory based on the request and response structures in CommSharedMem.
Although cudaHostRegisterMapped is used for shared memory mapping, the GPU does not actually read the CPU's memory directly. Data is still copied through the underlying PCIe channel, but the driver layer helps us do this implicitly.
Therefore, this causes a great performance loss, and the same conclusion is drawn from my test results.https://github.com/eunomia-bpf/bpftime/issues/411
Describe the solution you'd like
In the GPU scenario, can the map data be stored in the GPU global memory which is suitable for storing data that is frequently accessed by the GPU and then copied to the host side when the host actually needs to read it? Describe alternatives you've considered
Provide usage examples
Additional context
If this direction is feasible, I would be happy to invest some research and development in this issue.
Yes, I think so! We are working to fix it in our next version, this is a high priority.
We already have some GPU specific maps and helpers, like in
https://github.com/eunomia-bpf/bpftime/tree/master/runtime/src/bpf_map/gpu
This is per GPU thread, no lock and store in the GPU memory.
You can also find some GPU specific helpers in https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu
Any other suggestions?