No more output from models at step1 and step3
Hi, I tried using your DeepSpeed-chat example to train one facebook/opt-1.3B model using RLHF.
I'm using a custom dataset of 500 examples. I updated the data_utils.py and raw_datasets.py files to add my own custom dataset to the pipelines.
After training using the updated scripts, they don't output anything when I try to use the models.
Could it come from the fact the pipelines are only working with an embeddings size of 512? I'm using 2048 because I have large text inputs.
Can someone help me fix the pipelines to get something relevant with embeddings larger than 512 tokens?
Here is my evaluation code snippet to see how well perform the fine-tuned models:
from datasets import load_from_disk
from transformers import pipeline
model_step1="/home/chainyo/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/output"
model_step3="/home/chainyo/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/output/actor"
# I use model_step1 or model_step3 to see the difference
generator = pipeline('text-generation', model=model_step3, tokenizer="facebook/opt-1.3b", device=0)
data = load_from_disk("/home/chainyo/data/training_dataset")
data = data["test"]
# I pick one sample for testing
prompt = data[0]["prompt"]
prompt = f"Human: {prompt}\nAssistant:\n"
res = generator(
prompt,
max_length=2048,
num_beams=4,
temperature=0.9,
top_k=50,
top_p=0.9,
repetition_penalty=1.0,
length_penalty=1.0,
no_repeat_ngram_size=3
)
print(res[0]["generated_text"])
>>> Output the prompt with Assistant: ...
The results are just the prompt with the Assistant: keywords added.
Same issue ^
@chainyo @aleksandr-smechov Can you first try with the datasets we used in the example scripts? If using those datasets also leads to this problem, then it seems like a bug. If not, it means problems with your own dataset that would need to be fixed by yourselves.
Disclaimer: what DeepSpeed-Chat provides is a framework for efficiently training any ChatGPT-like model, and a set of example scripts that we tested in terms of system performance and whether the result model produces meaningful responses using the specific pretrained models/datasets/hyperparameters. However, we do not provide any guideline of "which pretrained models/datasets/hyperparameters to use in order to train the best ChatGPT-like model". Such guideline is impossible to provide given the rapidly evolving AI research: new models/datasets/techniques are invented every day. Users would need to explore themselves about how to train a better model for their situations.
@chainyo @aleksandr-smechov Can you first try with the datasets we used in the example scripts? If using those datasets also leads to this problem, then it seems like a bug. If not, it means problems with your own dataset that would need to be fixed by yourselves.
Are you suggesting I should train with the default settings? What's the point of this?
Or are you suggesting I launch the three steps with the default dataset and models but with an embedding size of 2048?
@chainyo Based on your original description, you have changed quite some things from the original example we tested: dataset, embedding size, etc. We won't be able to guarantee that it will always work out-of-box in those situations. On the other hand, what I was suggesting is that if you haven't done so, could you first try our original example without changing anything, just to double check whether you can get meaningful output in that case to rule out any potential bug in our framework.
Closing due to lack of activities, feel free to reopen/create a new issue when you have more info to share.