lightseq icon indicating copy to clipboard operation
lightseq copied to clipboard

Error when train and inference based on master code

Open dongdaoking opened this issue 2 years ago • 5 comments

Hi, i want to try some new feature in lightseq and follow here compiling from source in master branch. But when i train and inference follow example, it doesn't work. When training, it seem something wrong in ls_transformer.py.

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
    main(args, **kwargs)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
    model = task.build_model(args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 327, in build_model
    model = super().build_model(args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 547, in build_model
    model = models.build_model(args, self)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/models/__init__.py", line 58, in build_model
    return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 170, in build_model
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 237, in build_decoder
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in __init__
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in <listcomp>
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 409, in build_decoder_layer
ModuleNotFoundError: No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'

I try to fix the problem and it can train. But when i try to inference using the checkpoint, i get BLEU=0 and it seems lightseq doesn't work.

So here are questions:

  1. Is the code based on master branch right?
  2. How can i repair the code or the processing?

dongdaoking avatar Aug 12 '22 03:08 dongdaoking

I need to clarify two questions:

  1. Is the bleu score during evaluation right?
  2. Do you inference using pytorch(or export model to lightseq proto)? If yes, then you can checkout what's the difference between the evaluation and inference.

neopro12 avatar Aug 17 '22 03:08 neopro12

https://github.com/bytedance/lightseq/tree/master/examples/inference/python You can try this way to inference after training

neopro12 avatar Aug 17 '22 03:08 neopro12

Hi, i check the training log.

  1. The bleu score is wrong during evaluation.
  2. inference following the example.

As i describe above, why can't i train based on the master branch directly?

dongdaoking avatar Aug 17 '22 06:08 dongdaoking

The master branch works fine: https://github.com/bytedance/lightseq/blob/master/examples/training/fairseq/ls_fairseq_wmt14en2de.sh Can you give us some detail about your repair to fix the "No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'"

neopro12 avatar Aug 17 '22 13:08 neopro12

Hi, thanks for your reply. I run the command cp lightseq/training/cli/fs_modules/ls_fs_transformer_decoder_layer.py lightseq/training/ops/pytorch/ And point to this path

diff --git a/lightseq/training/cli/fs_modules/ls_transformer.py b/lightseq/training/cli/fs_modules/ls_transformer.py
index a6832ed..015f2fa 100644
--- a/lightseq/training/cli/fs_modules/ls_transformer.py
+++ b/lightseq/training/cli/fs_modules/ls_transformer.py
@@ -406,7 +406,7 @@ class LSTransformerDecoder(FairseqIncrementalDecoder):
                 TransformerDecoderLayer,
             )
         else:
-            from .ls_fs_transformer_decoder_layer import (
+            from lightseq.training.ops.pytorch.ls_fs_transformer_decoder_layer import (
                 LSFSTransformerDecoderLayer as TransformerDecoderLayer,
             )

Oh, i want to make sure our enviroment is the same. Can you provide a based docker image? Now my Enviroment

based docker images nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04
pytorch 1.8.0(compile)
cmake 3.20 (compile)
protobuf and HDF5 follow the https://github.com/bytedance/lightseq/blob/master/docs/inference/build.md
git clone --recursive https://github.com/bytedance/lightseq.git

Then i can run the lightseq but meeting the error.

dongdaoking avatar Aug 17 '22 14:08 dongdaoking