petals Roadmap (tentative)

Current tasks:

[ ] prototype bloom points system @borzunov (#6 )
[x] local tensor parallelism ( #143 , using BlackSamorez/tesnor_parallel by @blacksamorez and @IaroslavLisniak )
[x] increase default max sequence length (from #146 )
[x] allow running a server without open ports @Vahe1994
[ ] option to download pre-quantized blocks (@mryab )
[x] improved routing (@justheuristic )
[x] newest-latest libp2p - @Vahe1994
[ ] touch up fine-tuning examples, make sure they work in reasonable time ( @justheuristic )
[ ] a way to temporarily shutdown petals server
- suggested by @craffel : when running a petals server on a machine that is often in use, people should be able to shut off petals servers while running their experiments
- suggested behavior: shut down asap, restart once gpus are not in use for T minutes
Wanna contribute?
- go to our discord server and ask around!
- always in demand:
  - contribute examples (recommended but not required: create an issue / draft first, before you code them)
  - OS support / hardware support ( e.g. see #147 )
  - more models: OPT-175B, switch-XXL, whatever comes into fashion
  - host a server! (see README)

End of december: cover more use cases

[x] tutorial: generation notebook
[x] tutorial: prompt-tuning notebook
[x] PreLayer prompt-tuning - mentioned as one of the baselines in https://arxiv.org/abs/2106.09685 - DONE
[x] inference with prompt-tuning ( https://github.com/learning-at-home/bloom-demo/issues/13 by @artek0chumak)
[x] advanced inference: beam search, constraints/fusion, LoRA/AdaMix ( @artek0chumak , https://github.com/learning-at-home/bloom-demo/issues/13 )
[x] some kind of hub for tutorials, e.g. a minimalistic website
[x] alpha test: let more people play with 176B model (where? no-brainer: bigscience, stability, discord)
[x] rich inference interface for designing custom generation algorithms (by: @artek0chumak )
[x] let servers run requests with different priorities ( https://github.com/learning-at-home/bloom-demo/issues/8 by: @greenfatguy )
By this point, we must answer the main questions: (1) will people use it? (2a) what for? (2b) why not?

End of ~july~ august: make it reliable, test with early adopters

[x] make it so that servers cannot be killed by a bad client request ( https://github.com/learning-at-home/hivemind/issues/3 by: @justheuristic)
[x] find the best way to reduce the size of 176B model ( https://github.com/learning-at-home/bloom-demo/issues/4 by: @TimDettmers )
[x] let servers automatically find and serve the most in-demand layers ( @borzunov )
[x] implement popular non-beam inference types ( https://github.com/learning-at-home/bloom-demo/issues/13 by @artek0chumak )
[x] compress the activations sent between client and server nodes ( by: @mryab )
[x] find enough hardware to run the 176B model ( https://github.com/learning-at-home/bloom-demo/issues/14 by: @justheuristic )
[x] pre-alpha test: once it is stable enough, let some (trusted) folks play with it and get their feedback
[x] submit a EMNLP system demonstration proposal ( https://2022.emnlp.org/calls/System_Demonstrations/ )
begin investigating: tutorials, documentation, examples

End of june: build a proof-of-concept

[x] agree on the user interface (see https://github.com/learning-at-home/bloom-demo/issues/5#issuecomment-1159776494 )
[x] run simple (but correct!) inference with a smaller model (for generation)
[x] do simple (but correct!) forward/backward with frozen layers (for prompt tuning)
[x] client can dynamically choose which remote servers to use for inference ( by: @justheuristic )
[x] create basic correctness tests for later
[x] check if 8-bit compression is remotely feasible ( by: @TimDettmers )
it's okay if the code is not super reliable for now
it's okay if servers have to be set up manually for now
begin investigating: quantized weights, quantized communication, automatic server allocation, "bloom points"

Important, but not urgent:

multiplicative adapters from https://wandb.ai/learning-at-home/LM_OWT/reports/Parameter-sharing-revisited--VmlldzoxOTAxNjcx ?
non-critical performance improvements ( #11 )
better finetuning methods: LoRA, AdaMix, PreLayer (see LoRA/AdaMix), whatever is SoTA at the time of building
fully decentralized point system

Jun 20 '22 14:06 justheuristic

[moved inference of prompt-tuned model and priorities from summer to current tasks]

Aug 18 '22 11:08 justheuristic

hey, how hard would it be to extend petals to support training these models in addition to the fine tuning?

May 22 '23 21:05 bionicles

Hi @bionicles, Petals is a system designed specifically for inference of large models: however, it shares a lot of the underlying architecture with SWARM Parallelism (see https://github.com/yandex-research/swarm for a WIP implementation, which I hope to update in the coming weeks).

The short answer is "definitely possible", but please keep in mind that pretraining is out of scope for Petals. Hence, it might be more useful to continue the discussion elsewhere (e.g. to the SWARM repo or our Discord server) if you have specific questions or suggestions

May 23 '23 01:05 mryab

Hi @bionicles,

A small addition to the @mryab's response - while Petals does not support training from scratch, both Petals and SWARM are based on hivemind, our library for training over the Internet, which can be used for pre-training. Please see Q3 of the FAQ's "General" section for details.

May 23 '23 02:05 borzunov

petals petals copied to clipboard

Roadmap (tentative)

petals
petals copied to clipboard