Alexander Borzunov
Alexander Borzunov
Hi @scenaristeur, How much GPU memory does this model take in total? At first sight, it seems that this model requires < 8-10 GB and fits many consumer GPUs, so...
Hi @mryab, can you take a look at this?
Hi @Mathnerd314, Your suggestions sound reasonable. We'll start with an option to slice inference session (`reuse_inference(old[start:end])`) - I hope to add it in the nearest releases.
Hi @LouSparfell, As far as we know, homomorphic encryption and ZK methods are too slow to be applied for LLMs, since they are designed for integer computations and are not...
Hi @fadenb, What you're saying is 100% reasonable, we just didn't have time to do that since it would require additional complexity on the server-side. If you can help with...
Hi @iateadonut, Yes, a server should host a set of sequential blocks. Re mock CPU servers, you can create a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) with a really small model like `bigscience/bloom-560m` and...
Hi @iateadonut, `dht_utils.get_remote_module_infos()` returns information about all servers (remote and your own ones). Note that: - You need to be connected to the **public swarm** to see servers hosted by...
@iateadonut No, but you can filter out your local peer_id to keep only remote infos, like we do in `should_choose_other_blocks()`.
@fadenb @iateadonut For the record, another reason why downloading blocks is slow is that StableBeluga2 weights are distributed in float32 and Llama weights are distributed in float16, while we host...
@iateadonut Yes, you can extract it into a separate function if it's useful.