Nicolas Patry
Nicolas Patry
You send request: ``` input_ids = [A, B, C, A] # Those are tokens new_token_D, past = forward(input_ids) ``` That's a prefill step. Then we continue generating new tokens in...
> I understand that flash or sparse attention models won't use padding, but if the user generates a very long sequence, say thousands of tokens, how many of those tokens...
@OlivierDehaene Can we merge this ?
Try disabling it ? It should still download the model just a bit slower. `hf_transfer` is really barebones, and any flaky network might trigger issues for you (or because you're...
Isn't there a way for you to provide environement variables ? ``` HF_HUB_ENABLE_HF_TRANSFER=0 ``` Is what you are looking for.
Thanks for sharing the solution ! Closing this then
It should work, but you would need `--trust-remote-code` flag for it to work Can you provide a full stacktrace ?
I think the first one is a very easy fix we could implement. Today there are 3 issues about this conversion, so maybe making it a bit more robust/effective in...
Hi @mayurtikundi12 You need to work with latest for this model to work. We're going to release 0.9 soon which should work. @OlivierDehaene (For vis)
Try with `--auto-convert false`. This error happens when trying to convert to safetensors, but it shouldn't be required for non *core* models.