ymcki
ymcki
I find that if I login with my gmail account. Then it is ok. Otherwise, it is not. So maybe pytube should give a better error message?
You can login by adding two more parameters yt = YouTube(["https://www.youtube.com/watch?v=pNcQ5XXMgH4", use_oauth=True, allow_oauth_cache=True)
> It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can...
I tried gemma-3-12b with 32k context. Originally, it uses 12288MB fp16 KV cache. Now it uses 2048MB+970MB==3018MB. This is a huge improvement! My understanding is that 2048MB is the KV...
https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/src/transformers/cache_utils.py In transformers' implementation, the __init__ function of class HybridCache allocated KV cache this way for the global and local layers: ``` global_cache_shape = (self.max_batch_size, self.num_key_value_heads, max_cache_len, self.head_dim) sliding_cache_shape =...
Thanks for your clarification. So the formula local attention KV cache is 2*40*8*256*PAD(n_swa*n_seq_max + n_batch + 1, 32)/(1024*1024)? Empirically, I see that for batch_size 1 to 95, the local attention...
https://raw.githubusercontent.com/ollama/ollama/refs/heads/main/kvcache/causal.go In the Init function of Causal class, ollama seems to have an implementation similar to ggerganov. ``` var cacheSize int if c.windowSize == math.MaxInt32 || capacity < int(c.windowSize) {...
I find this warning when I set batch_size to less than 64. llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64 So this explains why there is a higher...
I am also observing degradation for both pp and tg for high batch size for the 12b model. ``` ./build/bin/llama-bench -m ~/gguf/google_gemma-3-12b-it-Q4_K_M.gguf -n 32 -d 8192 -b ./build/bin/llama-bench -m ~/gguf/google_gemma-3-12b-it-Q4_K_M.gguf...