DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

The reward in step3 seems to be completely random without any noticeable increase.

Open laoda513 opened this issue 2 years ago • 10 comments

I am testing the 1.3B training. Steps 1 and 2 have already passed, but there is no change in reward after completing step 3.

I used LoRa to train for one iteration, and the results of steps 1 and 2 are as follows: step1: ppl: 2.18959641456604

step2: image

Step3: image

I let chatgpt extracting the logs for step 3 and comparing them with the demo logs provided in the project. I found that the absolute value of my loss is significantly smaller, and the reward seems to be completely random without any noticeable increase. (stand)

image

image image image

laoda513 avatar May 07 '23 15:05 laoda513

My rewards seems even decrasing, despite the decrease in loss W B Chart 07_05_2023, 17_15_56 W B Chart 07_05_2023, 17_15_50 W B Chart 07_05_2023, 17_15_13

puyuanOT avatar May 07 '23 23:05 puyuanOT

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

laoda513 avatar May 08 '23 15:05 laoda513

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

Thanks a lot! Will try it out.

puyuanOT avatar May 08 '23 17:05 puyuanOT

Perhaps it's related to this PR https://github.com/microsoft/DeepSpeedExamples/pull/470?

puyuanOT avatar May 08 '23 17:05 puyuanOT

that's another bug I think.

laoda513 avatar May 09 '23 02:05 laoda513

@puyuanOT OK i got the solution. Try to disable the hybirdengine, this make the model always repeat 'a a a a a' not sure the reason.

I also meet this problem and have no idea why this is happening...

REIGN12 avatar May 09 '23 03:05 REIGN12

I open a new issuse to track this #503

laoda513 avatar May 09 '23 04:05 laoda513

Thank you for letting us know. We are now investigating if HE has any unexpected behavior

yaozhewei avatar May 19 '23 15:05 yaozhewei

Thank you for letting us know. We are now investigating if HE has any unexpected behavior

@yaozhewei I also encountered the same issue at deepspeed==0.9.0 and deepspeed==0.9.1. It can be reproduced by a very simple script. Wish this can help you :) If there is any progress, could you let me know, please

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import trace
import deepspeed
tracer = trace.Trace(count=True, trace=True)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-6.7b")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b", fast_tokenizer=False,use_fast=False)
tokenizer.padding_side = 'left'
ds_config ={ 'train_micro_batch_size_per_gpu': 4, 'steps_per_print': 10, 'zero_optimization': {'stage': 3, 'offload_param': {'device': 'none'}, 'offload_optimizer': {'device': 'none'}, 'stage3_param_persistence_threshold': 10000.0, 'stage3_max_live_parameters': 30000000.0, 'stage3_prefetch_bucket_size': 30000000.0, 'memory_efficient_linear': False}, 'fp16': {'enabled': True, 'loss_scale_window': 100}, 'gradient_clipping': 1.0, 'prescale_gradients': False, 'wall_clock_breakdown': False,
            'hybrid_engine': {'enabled': True, 'inference_tp_size': 8, 'release_inference_cache': False, 'pin_parameters': True, 'tp_gather_partition_size': 8}}
engine, *_ = deepspeed.initialize(model=model, config=ds_config)
engine.eval()
sent = ["Human: List five action models\n\nAssistant: ", "Human: hello\n\nAssistant: "]
inputs = tokenizer(sent, padding=True, return_tensors='pt')
inputs = inputs.to(model.device)
gen_kwargs = {"max_length": `512}`
output = engine.module.generate(inputs["input_ids"], **gen_kwargs)
torch.cuda.synchronize()
for o in output:
    response = tokenizer.decode(o)
    print (response)

This script uses the opt-6.7b model to make predictions. When I turn off HE or turn on HE with an inference_tp_size of 1, the results match my expectations. However, if I turn on HE with an inference_tp_size greater than 1 (such as 2 or 8), the predicted result is ((((, as shown in the figure below. image

This is the testing environment I used.

image transformers==4.30.0.dev0 deepspeed==0.9.0

beichengus avatar Jun 01 '23 02:06 beichengus

@yaozhewei Same error for training Llama, step1 and step 2 are normal, but step 3 just won't converge

AlisonWen avatar Nov 20 '23 00:11 AlisonWen