Sidharth Baskaran
Sidharth Baskaran
I'm getting an error `RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 0` when initializing an engine for offline batched...
@Yard1 Thanks for fixing! Running into an CUDA OOM error this time with the same code: I'm using a single A100-40GB. I did specify `bfloat16` when initializing the engine, so...
@Yard1 Thanks, was able to get inference working by reducing the default `max_num_seqs ` from 256 to a much smaller number like 32. With `enable_lora=False`, I can use `max_num_seqs=256`.
I'm also running into errors installing from source with the latest commit. This happens with both `python setup.py install` and `pip install -e .`. Seems like it's originating from compiling...
Just submitted the PR to implement R-ROME: #237
Been thinking about how to further speed this up, I ended up coming up with a solution very similar to yours. Is there any way of parallelizing across the number...
@sabetAI came across this function after asking for help in the PyG community: https://pyg-lib.readthedocs.io/en/latest/modules/ops.html#pyg_lib.ops.segment_matmul. It effectively vectorizes across the number of adapters. Some quick testing shows it's actually much slower...