AlpinDale comments

Results 170 comments of


                                            AlpinDale

[V1] Support bad_words in sampler

As previously discussed offline, I believe bad_words is an incomplete solution to a very real problem; banning only the last token in the provided sequence is not useful at all,...

[Usage]: Request for Trace ID Logging in Inference Engine

That sounds useful, I will look into it soon. I can try and implement it on our side if possible, otherwise middleware should work fine for now.

[Feature]: Automatic max-model-len or max-num-seqs

I'm not sure I understand the issue. If you need the engine to limit the max_model_len to the amount your GPU can fit, then we already handle that, as you...

[Installation]: Install v0.6.6/v0.6.7 on amd gpu gfx906 failed, v0.6.5 success but cannot run gptq

0.9.0 works, but 0.9.1 doesn't due to the new vectorized activation kernels being incompatible with ROCm. I will address this soon.

[Feature]: tensor parallelism support for bnb quantization (via IBM's fork)

Perhaps, I'll have to look into it. bnb hasn't been a priority

[Feature]: tensor parallelism support for bnb quantization (via IBM's fork)

FYI I'm working on new kernels for massively speeding up bnb quants + add TP support for them. You might want to hold on for now, or help out with...

Reduce peak memory for prompt_logprobs requests

Will probably need some restructuring after #925

[Misc]: should we be using flashinfer for CUDA 12.1 or 12.4?

I believe code compiled on top of CUDA 12 works across all versions with different minor revisions. But we can change that to flashinfer's 12.4 wheels, if they have any.

[ Kernel ] AWQ Fused MoE

Thanks for doing this.

[Bug]: Another Bug while starting (Global)

Please perform the instructions in the issue template and run the env.py script so I can see what environment you're working with. I have no idea what the default kaggle...