Nicolas Patry comments

Results 978 comments of


                                            Nicolas Patry

"Unauthorized for url: https://huggingface.co/api/models/bigcode/starcoder"

I think you need to pass your token, since this model is gated behind access request (you cannot download it being a non logged in user essentially). ``` docker run...

"Unauthorized for url: https://huggingface.co/api/models/bigcode/starcoder"

> Ah ok, thanks @Narsil, and the authentification is required only for the download right ? Yes > it is for every model ? how can i know by advance...

"Unauthorized for url: https://huggingface.co/api/models/bigcode/starcoder"

The `-e X=Y` needs to happen before the docker image `ghcr:....` This is how docker-cli works (currently the `-e` is sent directly to `text-generation-launcher` which indeed doesn't have this flag)

FlashLlama doesn't look for safetensors files

We just landed a massive rework, we should make enabling GPTQ much easier, now only `safetensors` files are read and used.

Inference support for GPTQ (llama + falcon tested) + Quantization script

> Wondering when this PR will be merged and whether you will be uploading a falcon-40b-instruct-gptq as well? I think many including myself don't have access to a GPU with...

Inference support for GPTQ (llama + falcon tested) + Quantization script

> Is the act-order option supported? Is exists in code as everything is simply pulled, but not exposed yet. Any good info of what act-order does, and implications ? (If...

Inference support for GPTQ (llama + falcon tested) + Quantization script

> @Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation...

Inference support for GPTQ (llama + falcon tested) + Quantization script

Try using options here:: https://huggingface.co/docs/accelerate/usage_guides/big_modeling Notably `device_map = infer_auto_device_map(my_model, max_memory={0: "10GiB", 1: "10GiB", "cpu": "30GiB"})` seems like a good option to reserve enough memory on GPU0 (you could say 0...

Inference support for GPTQ (llama + falcon tested) + Quantization script

That means the layers wasn't loaded at all, probably disk offloaded. I'm not familiar enough with accelerate internals, but there must be some way to fetch the information of where...

Inference support for GPTQ (llama + falcon tested) + Quantization script

@psinger Illegal access seems like a triton bug. Which GPU are you using ? I'm guessing if it's old, triton might be creating invalid kernels @0x1997 Absence of `g_idx` means...