sgsdxzy issues

Results 11 issues of


                                            sgsdxzy

Does this model support temperature and repetition_penalty?

Without temperature, the generation is always fixed. Whitout repetition_penalty, it seems to fall into a loop of repeating last words. Using model glm-10b-chinese.

Unable to load lora for large models

### Describe the bug Spec: 2080Ti 22G *3, trying to run llama-30B + alpaca-30b from https://huggingface.co/baseten/alpaca-30b. Cannot get the lora to load without VRAM OOM. I can run llama-30B in...

bug

Automatic model parallel inference by deepspeed

**Description** The current multi-gpu setup uses a simple pipeline parallelism (PP) provided by huggingface transformers, which is inefficient because only one gpu can work at the same time. DeepSpeed-Inference introduces...

enhancement

Update to support GPTQ triton commit c90adef

The new `fused_mlp` seems to not work on some cards https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/179. If passing `--no-fused_mlp` everything should work.

[BUG] Inference fail with "mat1 and mat2 shapes cannot be multiplied" for Llama model.

**Describe the bug** Inference fail with `RuntimeErrorRuntimeError: : mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)` when trying to...

bug

inference

[Feature] 有没有计划把代码合并到hugginface transformers main

### Is your feature request related to a problem? Please describe. 目前是通过remote code的方法载入的，有没有计划把模型实现直接合并到transformers main呢，这样 1. 不需要remote code 2. 方便有一个统一的接口 3. 宣告代码稳定，便于在此基础上开发新功能，比如peft、GPTQ ### Solutions 通过Pull request将GLM的实现合并至transformers main ### Additional context _No...

enhancement

Data receive request timed out when using trackir5 device

I have a trackir5 device and I confirm it to work under Windows10. I want to set up track under my Archlinux. I tried both building linuxtrack 0.99.19 from source...

Fused mlp causes assertion error

After c90adefbf1934f4638ea5c3bba8fc536aad3de57, when `fused_mlp` is enabled, I got the following error: ``` python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"'...

Slow inference performance for large Llama models compared to naive MP

The inference speed of naive model parallel is much better than tensor parallel: Setup: Llama-30b on 2080Ti 22G x4 Naive: 31.64s 4-way TP, main branch: 177.78s 4-way TP, llama branch:...