hank
hank
> 使用hmm唷! hmm会产生新词,这样不行唷
@mrwyattii First thank for your reply :) What is very strange is that I used your above example of tp=1 and tp=2 for testing. tp=2 costs a lot of time....
@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer...
> Hi all, > > I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't...
> Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together. @PerkzZheng Is there any progress on this issue?
@Superjomn @byshiue Same question, can you give an example of successful use of stop_words or stop_words_list? Thank you, I am currently using the service started by tensorrtllm_backend, the commit number...
Same issue , can you help to research this issue? thanks @kaiyux
@byshiue Any process? At present, I still have this problem using the latest version of the code ( tensorrtllm: a96ccca)
> > @byshiue Any process? At present, I still have this problem using the latest version of the code ( tensorrtllm: [a96ccca](https://github.com/NVIDIA/TensorRT-LLM/commit/a96cccafcf6365c128f004f779160951f8c0801c)) > > it's fixed for my case, after...
> We have added it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py#L474-L478. Do you still encounter same issue on latest main branch? If so, could you try printing some debug message there to make sure...