Open-Assistant Proposal: use distributed network for training

The amount of data in the dataset is growing at an incredible rate, so I suggest leveraging the power of the community by, for example, installing a special node on the user's computer in a Docker container that would allow Open Assistant developers to create tasks on these nodes to train the model.

Benefits:

Potentially large number of nodes
High efficiency
Low cost for developers to rent servers

Cons:

Possible tampering with data when modifying the container
Potential vulnerability in case of hacking master node

Feel free to discuss

Mar 03 '23 10:03 Redict

The problem with using distributed networks for training large models is the massive bandwidth needed to transfer gradients. Models would likely be in the billions of parameters and sending this data back and forward from nodes to a central server every training step would make training unfeasible due to the latency it would create.

Mar 03 '23 13:03 someone13574

@someone13574 In what form will these gradients be transmitted? What data sizes are we talking about? Is it possible to represent them in binary form and compress them, for example with zstd? I don't think sending huge amounts of data over net is actually a problem. It's cheaper to get more bandwidth than getting another few GPU's in rent.

Mar 03 '23 14:03 Redict

It is not super easy to do distributed training, but there are specialized groups, e.g. check https://github.com/learning-at-home/hivemind

Mar 03 '23 15:03 andreaskoepf

and https://www.together.xyz/

Mar 03 '23 15:03 andreaskoepf

@andreaskoepf there's also BONIC-driven solutions

Mar 03 '23 15:03 Redict

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

Mar 04 '23 09:03 umbra-scientia

I think distributed pre-processing would work great for specific elements, such as the “execution of API calls”-step in ToolFormer.

You could:

centrally sample the API calls using the LM and various prompts,
then execute the API calls decentralized (this part is easily containerized and probably more influenced by CPU resources and network latency), and
finally centrally filter the API calls/responses and fine-tuning the model.

Excerpt from the ToolFormer paper:

Executing API Calls As a next step, we execute all API calls generated by $M$ to obtain the corresponding results. How this is done depends entirely on the API itself – for example, it can involve calling another neural network, executing a Python script or using a retrieval system to perform search over a large corpus. The response for each API call $c_i$ needs to be a single text sequence $r_i$.

You would not be transferring millions of parameters, gradients, etc. You would be sending urls to docker images and the API calls you want executed, and you would be receiving responses from these API calls. Validation of responses could be simple majority voting.

Mar 05 '23 11:03 jsekamane

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

Very interesting idea. If someone is interested in closer planning and prototyping this please let us know (either here or OA discord).

Mar 05 '23 13:03 andreaskoepf

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

Very interesting idea. If someone is interested in closer planning and prototyping this please let us know (either here or OA discord).

I am not really familiar with all of these, but I can implement it if I'll get more in-depth explanation.

Mar 05 '23 21:03 Redict