matmulfreellm LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32

When I run with gpt2 models, all its ok! But when I run with anyone those models exists ridger/MMfreeLM-370M, MMfreeLM-1.3B or MMfreeLM-2.7 this error occur.Why? Can anyone help me?

Error: LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 [1] 93105 IOT instruction python3 generate_text.py

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import mmfreelm
from transformers import AutoModelForCausalLM, AutoTokenizer

# Nome do modelo pré-treinado
#name = 'ridger/MMfreeLM-370M'
name = 'ridger/MMfreeLM-1.3B'
#name = 'ridger/MMfreeLM-2.7B'
#name = 'openai-community/gpt2'

# # Carregar o tokenizador e o modelo
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name).cuda().half()

# input_prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, "
# input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.cuda()
# outputs = model.generate(input_ids, max_length=32,  do_sample=True, top_p=0.4, temperature=0.6)
# print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs.input_ids.cuda()
    attention_mask = inputs.attention_mask.cuda()
    outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=32, do_sample=True, top_p=0.4, temperature=0.6, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

while True:
    prompt = input("Você: ")
    if prompt.lower() in ['exit', 'quit']:
        break
    response = generate_response(prompt)
    print(f"Modelo: {response}")

Jul 03 '24 23:07 jeisonmp

Do you know if run in WSL 2 win10?

Jul 05 '24 13:07 jeisonmp

What NVIDIA GPU are you using?

Jul 11 '24 00:07 nevercast

Hi @nevercast ! Is NVIDIA GeForce GTX 1050

Jul 17 '24 18:07 jeisonmp

Hi!

My immediate assumption is that the GTX 1050 does not have a compute version new enough to support this kernel function - I can validate this later for you, haven't had coffee yet.

If that is the case though, you might want to consider trying to run this project on a free Google Colab notebook with a T4 GPU attached.

Jul 17 '24 20:07 nevercast

GTX 1050 is Compute version 6.1. Triton (which I believe this project uses) has dropped support for anything less than 7.0.

You may be able to get a build to work but you'd be own your own effort and against the grain so to speak.

Jul 18 '24 11:07 nevercast