dpwa icon indicating copy to clipboard operation
dpwa copied to clipboard

Questions (Meta Issue)

Open innerop opened this issue 5 years ago • 8 comments

Hi Guy,

Really intrigued by this.

I'd like to experiment with it on a couple machines each with NVIDIA GPU and Pytorch compiled to take advantage of GPU (CUDA 9) and a master node (if needed) that has an multi-core Intel server CPU.

I expect to have some questions along the way. What's the best way to ask questions. Can I use this thread?

innerop avatar Feb 09 '19 20:02 innerop

Hey @innerop,

Sure will be happy to answer any question you may have!

loudinthecloud avatar Feb 10 '19 12:02 loudinthecloud

Hey @loudinthecloud,

I'm making my way through the referenced papers and learning PyTorch as a framework.

But before going further, I wanted to make sure your approach here would work for an actual P2P network (ignore how the network is setup as that's an implementation detail, just assume some latency and possibility of nodes falling out of the network during training)

The instructions state:

"start a local cluster where each node is running on a different cpu core"

Is this assuming all nodes run on the same server, just different cores? Or can it be run on separate servers? Also, how do you enable GPU (one per node) with PyTorch?

Thank you and looking forward..

innerop avatar Feb 12 '19 17:02 innerop

@innerop, yes this should work in a P2P network, all communication is done over TCP/IP. A connection is created between every two nodes in the cluster. In case the connection gets broken, it's flow control score is decreased, but it will be retried again in the future with diminishing probability - i.e, nodes can be restarted. After a successful transfer, the flow control score increase (see conn.py)

I've tested it on a local machine where a node is bound to each core (setting the affinity), the result was 100% utilization for all cores.

The pytorch adapter support training on GPU and will convert the tensors to cpu tensors before sending their data over the network, the pytorch-cifar examples uses gpus. All you need to do is use it with a cuda enabled model.

loudinthecloud avatar Feb 12 '19 19:02 loudinthecloud

This is great.

I will be trying it out on two edge devices each having a 10TFLOPS GPU (with Tensor cores) and CUDA 9/10, and a Cloud server having an 8-core Xeon CPU, but I was hoping I can allocate work exclusively to the GPU nodes and ignore the server in the cloud (it acts as a VPN gateway to connect the edge nodes) It should be possible to just run it on the edge servers given your description.

Can’t wait to get to the point where I can test things out. Right now I’m just reading up and looking at the code. It may take me a bit to get to the “let’s wire things up and see” stage. More important for me to understand the fundamentals than simply use it as a blackbox.

Regarding sending only CPU Tensors, I wonder if they also could be compressed before transmitting.

innerop avatar Feb 13 '19 22:02 innerop

Sending tensor data over a VPN connection could be an issue, an extremely slow node may slow down the whole cluster as other nodes gets blocked when communicating with it.

Adding compression is a great idea, maybe it can also be controlled from the configuration file. Feel free to contribute a patch if it works for you. Calling zlib.compress() in the TxThread and zlib.decompress() in the RxThread will do the trick.

To be frank, the use-case I was targeting was using 10s of nodes for training but I don't see a reason to why it won't work with just two nodes as well.

You can try the pytorch-cifar test initially (in the examples folder) even on a single node, measure utilization, convergence, etc. When that works for you, you can start to tune the configuration file and try various methods. If congestion is an issue, you can reduce the value of the fetch_probability parameter in the configuration file.

Sorry I haven't "productized" this project yet, it is in experimental stage, I hoped someone will pick it up and run some large scale tests and evaluate how the networks are converged compared to other methods, unfortunately I didn't have the resources to do that but it worked in my use-case.

loudinthecloud avatar Feb 14 '19 06:02 loudinthecloud

Well, I have setup a Meetup for others to join and contribute their GPUs to such a scaling experiment. I have two NVIDIA Jetson TX2 SBCs (single board computers) each with a couple TFLOPS at full precision. I’ve managed to build a Kernel-resident super efficient VPN with wireguard and have built dockerized CUDA and latest PyTorch. The device costs only $600 so it’s conceivable that I’ll be adding more devices. But the point is for others to join and add their devices.

The Jetson TX2 has only 8GB of RAM so I added another 8GB from its 32GB internal SSD as swap, but that is significantly slower than RAM, so I plan to optimize memory use during training using something similar to OpenAI’s gradient checkpointing (trade computation for memory) which I believe has been added to AutoGrad in PyTorch.

But a lot of this stuff is new to me. I’m a software engineer (Distributed Systems and DevOps) I got the idea to build an open source system made up of other people’s edge devices (to show people what can be done without using the big cloud providers) and use it to train and serve AI models, and so I found your project and wish to evaluate it.

What are the core features that make it suitable for P2P besides the one feature you’ve mentioned already. I realize most of the answer is in the referenced papers but it would be great to hear it from you in context of what you’ve managed to do so far.

Than u and looking forward to continuing this exploration.

innerop avatar Feb 14 '19 22:02 innerop

Interesting, I think the throughput across regions will be a pain, and I'm not confident dpwa will be suitable without modifications (100% you'll encounter obstacles that you'll need to overcome).

Also, there is the question of whether the network will converge and how well will it train, this alone require testing, even before growing the scale and using heterogeneous hardware, which have extremely different latency between nodes. As much as I understand, the papers assume constant latency more or less, I may be wrong though.

I feel you're trying to take it to an advanced level, even before the basic level of usage is established. Not saying it can't be done, I'm optimistic in general, but perhaps take it a step at a time.

Have you looked at federation learning? maybe solutions that implement it can be better fitted to handle your use case, unfortunately solutions in that area are much more complex.

The general idea sounds cool, it'll definitely be valuable.

loudinthecloud avatar Feb 15 '19 14:02 loudinthecloud

That's why I'm testing with 2 nodes of identical hardware.

I did look at tf-encrypted and PySyft (and Syft) but I'm not sure that I need the extra stuff to do with privacy, unless I make it about user-generated data, and then I'm not sure that they really have worked out the P2P angle with all the issues around nodes randomly dropping off and coming on and relative latency compared to the cloud.

Serving inference requests at scale is not an issue. I can use TensorRT and other optimizations like PocketFlow. What I'm concerned about is how to scale the training in P2P fashion. I have some ideas but I'm in the exploration stage.

innerop avatar Feb 15 '19 20:02 innerop