sgsdxzy
sgsdxzy
Here's my results on 3080Ti, fig size 512x512, eular a 30 steps, batch size=8: | settings | it/s | | --- | --- | | default | 2.10 | |...
I met similar problems. I think this is probably caused by tokenizer config adding extra tokens and not handled correctly.
@oobabooga Update: It seems I have to load the whole model for every process and let it chunk (before I split load the model to multiple GPUs so each process...
@oobabooga Here are some mixed news, and still very interesting: First, I managed to get GPT-Neo and OPT to work. In fact the kernel support list includes most types textui...
@huangjiaheng 你可以用中文,我看得懂 It seems your translation software is cutting off sentences. If you struggle with English you can use Chinese.
Update: I can get split loading to work according to example https://github.com/huggingface/transformers-bloom-inference/blob/e970be1027afc43c147d06153635f4285c517081/bloom-inference-scripts/bloom-ds-inference.py but int8 and llama is still not working yet
With the help from https://github.com/microsoft/DeepSpeed/issues/3099, I managed to make tensor parallel inference working for Llama! However I noticed that without a custom optimized kernel, the performance does not scale: 2080Ti...
I think maybe you can make groupsize a parameter that defaults to 128, not a hard-coded one. That can also support -1 to load old 4bit models.
@oobabooga Now that @ortegaalfredo has pinned down the problem, this is easy to fix by replicating the original device map: ``` params['device_map'] = {"base_model.model."+k: v for k, v in shared.model.hf_device_map.items()}...
I am wondering if the model.half() is still necessary, as it can take several minutes for large models.