ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[Nano] Enable both ipex 1.11 and bf16 will raise AttributeError

Open aixideng opened this issue 1 year ago • 4 comments

Description

Run Trainer.fit with both ipex 1.11 and bf16 enabled will get the error below:

error-message

Environment

Python=3.7.13
torch=1.11.0
pytorch_lightning=1.6.4
ipex=1.11.0
bigdl-nano=2.1.0b20220801

aixideng avatar Aug 02 '22 13:08 aixideng

In bigdl-nano 2.1.0b20220802, it will raise another error as follows:

Traceback (most recent call last):
  File "train.py", line 146, in <module>
    main(args)
  File "train.py", line 98, in main
    val_dataloaders=val_loader)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance
    self.optimizer_idx,
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
    using_lbfgs=is_lbfgs,
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1646, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/opt/workspace/dax/anaconda3/envs/recsys/lib/python3.7/site-packages/intel_extension_for_pytorch/optim/_optimizer_utils.py", line 54, in master_param_non_fused_step
    k.grad = value['bf16_param'].grad.detach().float()
AttributeError: 'NoneType' object has no attribute 'detach'

aixideng avatar Aug 03 '22 10:08 aixideng

By the way, in my experiments, the test accuracy is the same whether enable_bf16 is True or False, which makes me wonder if this option takes effect.

enable_bf16 Fit Time Test Accuracy Test Loss
False 4802.48s 55.5208% 1.1799
True 4779.57s 55.5208% 1.1799

The model architecture is as follows:

  | Name          | Type            | Params
--------------------------------------------------
0 | cross_encoder | XLMRobertaModel | 117 M
1 | fc_layer      | Sequential      | 1.5 K
--------------------------------------------------
117 M     Trainable params
0         Non-trainable params
117 M     Total params
470.569   Total estimated model params size (MB)

aixideng avatar Aug 03 '22 12:08 aixideng

I tried enabling both ipex and bf16, but get different error message the input and weight need have same data type. Can you share the code so that i can reproduce this issue?

y199387 avatar Aug 04 '22 02:08 y199387

I tried enabling both ipex and bf16, but get different error message the input and weight need have same data type. Can you share the code so that i can reproduce this issue?

Sure. Please take a look at this repo.

aixideng avatar Aug 04 '22 04:08 aixideng