LS comments

Results 20 comments of

LS

Question: what does the maximally activate some neuron mean exactly ?

@yosinski could you help with it when you have time? Many thanks.

Support loras on quantized models

could someone explain what's the idea behind for handling different dtype of base and Lora weights?

是否考虑引入 flash attention transformer

> 引入flashattention2 flash attn2 integrated into https://pytorch.org/blog/pytorch2-2/ already.

Multi-lora support

> This is definitely on our roadmap and will be tackled in the coming weeks. Here are the priorities right now: > > 1. re-write the scheduling code and cache...

Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")

+1, same error happened for `Qwen1.5-14B-Chat-AWQ (1*4090)` and `Qwen1.5-70B-Chat-AWQ (4*4090)`, even decreased the `--max-batch-prefill-tokens` to 512.

Respect to cooldownPeriod for the first deployment and let service is up and and running based on replica number for the first time.

@JorTurFer Hello, any plan for this fix? thanks.

Respect to cooldownPeriod for the first deployment and let service is up and and running based on replica number for the first time.

Is there any progress here? really need this feature :) thanks.

Slashes in parameter

> In this case: > > ```go > r.GET("/test/:name/*trail", func (c *gin.Context) {} > ``` > > The `trail` parameter seems to always contain a slash as the first character,...

"Please hold" landing pages for slow scale from zero scenarios

I come up with similar requirement, just customizing the error message is enough for me.

"Please hold" landing pages for slow scale from zero scenarios

> by content and return code +1