DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] model.eval() doesn't work with DeepSpeed Transformer Kernel

Open hezq06 opened this issue 2 years ago • 2 comments

Describe the bug The traditional way of model.eval() seems doesn't work with DeepSpeed Transformer Kernel. The training flag is changed, however, the randomness is still there.

To Reproduce I've made a simple example python file to reproduce the problem. DSTrfKernel_issue.txt

To reproduce, change .txt to .py, and simply run: python DSTrfKernel_issue.py in an environment with python, torch and deepspeed.

Expected behavior The tensors should not change but they acctually changed.

ds_report output Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/hezq17/anaconda3/envs/dynmlm/lib/python3.7/site-packages/torch'] torch version .................... 1.8.0 torch cuda version ............... 11.1 torch hip version ................ None nvcc version ..................... 11.2 deepspeed install path ........... ['/home/hezq17/anaconda3/envs/dynmlm/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.6.5, unknown, unknown deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

System info (please complete the following information):

  • OS: CentOS7
  • GPU count and types: Nvidia V100S
  • Python version: 3.7.13

Launcher context python

Docker context N/A

hezq06 avatar Jul 28 '22 10:07 hezq06

Hi @hezq06,

Can you please update your deepspeed with the master branch and see if the issue persists? Thanks, Reza

RezaYazdaniAminabadi avatar Aug 29 '22 17:08 RezaYazdaniAminabadi

I've tried my test script on the current master version of Deepspeed (0.7.3+aca34a9). Unfortunately, the test doesn't pass and the bug still persist.

hezq06 avatar Aug 30 '22 05:08 hezq06

hi @hezq06,

In the script you pointed, I see that training flag is set to true in ds_config dsconfig = DeepSpeedTransformerConfig( batch_size=16, hidden_size=32, intermediate_size=64, heads=2, attn_dropout_ratio=0.1, hidden_dropout_ratio=0.1, num_hidden_layers=2, initializer_range=0.02, local_rank=-1, seed=1234, fp16=False, pre_layer_norm=True, attn_dropout_checkpoint=False, normalize_invertible=False, gelu_checkpoint=False, stochastic_mode=False, training=True )

even though the training flag of torch.nn.module is set to false after eval() call, ds_config's training flag is being used by deepspeed during forward call. Setting training flag to false in ds_config will give correct output. Please confirm this.

lokoppakmsft avatar Dec 08 '22 21:12 lokoppakmsft