Nicolas Patry
Nicolas Patry
I think you need to pass your token, since this model is gated behind access request (you cannot download it being a non logged in user essentially). ``` docker run...
> Ah ok, thanks @Narsil, and the authentification is required only for the download right ? Yes > it is for every model ? how can i know by advance...
The `-e X=Y` needs to happen before the docker image `ghcr:....` This is how docker-cli works (currently the `-e` is sent directly to `text-generation-launcher` which indeed doesn't have this flag)
We just landed a massive rework, we should make enabling GPTQ much easier, now only `safetensors` files are read and used.
> Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with...
> Is the act-order option supported? Is exists in code as everything is simply pulled, but not exposed yet. Any good info of what act-order does, and implications ? (If...
> @Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation...
Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling Notably `device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"})` seems like a good option to reserve enough memory on GPU0 (you could say 0...
That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where...
@psinger Illegal access seems like a triton bug. Which GPU are you using ? I'm guessing if it's old, triton might be creating invalid kernels @0x1997 Absence of `g_idx` means...