scuda
scuda copied to clipboard
Is there a feasible solution for one-to-many remote access?
The current project seems to only support one-to-one connections between the client and the server. If we want a single client to access resources from multiple GPU hosts, is there a feasible solution?
The current project seems to only support one-to-one connections between the client and the server. If we want a single client to access resources from multiple GPU hosts, is there a feasible solution?
Yes this is feasible and it's on the roadmap however there's a good amount of work to do to make it happen. It's a great idea though and I'm excited to see it!
The current project seems to only support one-to-one connections between the client and the server. If we want a single client to access resources from multiple GPU hosts, is there a feasible solution?
Yes this is feasible and it's on the roadmap however there's a good amount of work to do to make it happen. It's a great idea though and I'm excited to see it!
I'm glad to receive this reply and I'm looking forward to this project. Could you briefly explain the principles of the solution?
The current project seems to only support one-to-one connections between the client and the server. If we want a single client to access resources from multiple GPU hosts, is there a feasible solution?
Yes this is feasible and it's on the roadmap however there's a good amount of work to do to make it happen. It's a great idea though and I'm excited to see it!
I'm glad to receive this reply and I'm looking forward to this project. Could you briefly explain the principles of the solution?
The general idea of the solution is to maintain a map of client device number -> (host, host device number) on the client and then before any operations, convert the client device number to that of the host. With this mapping, the client is able to behave like it has all the remote devices on the same host.
There is some initial groundwork already laid, if you look at the rpc API right now we always have an index however at all callsites this is hardcoded to zero. This is intended to be the device number in the future so we've designed the APIs for this use case, however none of that wiring is done yet.
I see that cudnn, cublas, and cudart libs are all hijacked in the code. According to my understanding, these requests will eventually reach libcuda.so. If we only hijack all interfaces of libcuda.so, can we achieve remote access? @kevmo314