support for https://huggingface.co/nvidia/Nemotron-4-340B-Instruct ?
can you add support for this bad boy: https://huggingface.co/nvidia/Nemotron-4-340B-Instruct ?
from airllm import AutoModel import torch
MAX_LENGTH = 15
could use hugging face model repo id:
model = AutoModel.from_pretrained("unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit", delete_original=True)
input_text = [ 'What is the capital of United States?', ]
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False)
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=2, use_cache=True, return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
AssertionError: Torch not compiled with CUDA enabled
colab tpu
from airllm import AutoModel import torch
MAX_LENGTH = 15
could use hugging face model repo id:
model = AutoModel.from_pretrained("unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit", delete_original=True)
input_text = [ 'What is the capital of United States?', ]
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False)
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=2, use_cache=True, return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
AssertionError: Torch not compiled with CUDA enabled
colab tpu
I have a RTX 3090, any idea how much disk space would be required to run nemotron. Also, how can I load the model from a different directory?