FFAMax

Results 25 issues of FFAMax

Scenario: model require 3 node to match minimal VRAM. Once one of the node experienced timeout, task will fallback to host and lead to crash due not enough memory. In...

Due some reason browser may disconnect from node, task keep running but WEB UI will unable to recover last conversation.

1. Added few GPUs. 2. Tuned timeout. On slow setups (~1 token per second) average response may take ~600-1000 tokens. In most cases it will lead to timeout (network error...

Found instability: ``` Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=15, n_layers=32): SQLite objects created in a thread can only be used in that same thread. The object was created...

Example: CUDA_VISIBLE_DEVICES=0,1,2 python3 examples/llama3.py --download_model --shard 3 So exo need to switch from model where only one device per node, to model where node is a node where multiple devices...

The last meaningful output: ``` ram used: 5.67 GB, tok_embeddings.weight : 99%|▉| 290/292 [00:25

Hello, Team. Have 2 nodes on the same host, but only one taking load. Any ideas? Running by: ` CUDA_VISIBLE_DEVICES=0 exo --node-id=node1 --node-port=65001 --discovery-module manual --discovery-config-path n2.cfg --inference-engine=tinygrad` `DEBUG=6 CUDA_VISIBLE_DEVICES=1...

It's trying load and never completed ``` Removing download task for Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16): True 0%| | 0/148 [00:00

Like ``` exo --run-model /home/user/.cache/tinygrad/downloads/llama3-8b-sfr/tokenizer.model ```

Example: ``` SUPPORT_BF16=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python3 examples/llama3.py --download_model --shard 7 --size 8B seed = 1730782018 0%| | 0/292 [00:00