lightseq
lightseq copied to clipboard
Error when train and inference based on master code
Hi, i want to try some new feature in lightseq and follow here compiling from source in master branch. But when i train and inference follow example, it doesn't work. When training, it seem something wrong in ls_transformer.py.
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
main(args, **kwargs)
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
model = task.build_model(args)
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 327, in build_model
model = super().build_model(args)
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 547, in build_model
model = models.build_model(args, self)
File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/models/__init__.py", line 58, in build_model
return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 170, in build_model
File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 237, in build_decoder
File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in __init__
File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in <listcomp>
File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 409, in build_decoder_layer
ModuleNotFoundError: No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'
I try to fix the problem and it can train. But when i try to inference using the checkpoint, i get BLEU=0 and it seems lightseq doesn't work.
So here are questions:
- Is the code based on master branch right?
- How can i repair the code or the processing?
I need to clarify two questions:
- Is the bleu score during evaluation right?
- Do you inference using pytorch(or export model to lightseq proto)? If yes, then you can checkout what's the difference between the evaluation and inference.
https://github.com/bytedance/lightseq/tree/master/examples/inference/python You can try this way to inference after training
Hi, i check the training log.
- The bleu score is wrong during evaluation.
- inference following the example.
As i describe above, why can't i train based on the master branch directly?
The master branch works fine: https://github.com/bytedance/lightseq/blob/master/examples/training/fairseq/ls_fairseq_wmt14en2de.sh Can you give us some detail about your repair to fix the "No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'"
Hi, thanks for your reply.
I run the command cp lightseq/training/cli/fs_modules/ls_fs_transformer_decoder_layer.py lightseq/training/ops/pytorch/
And point to this path
diff --git a/lightseq/training/cli/fs_modules/ls_transformer.py b/lightseq/training/cli/fs_modules/ls_transformer.py
index a6832ed..015f2fa 100644
--- a/lightseq/training/cli/fs_modules/ls_transformer.py
+++ b/lightseq/training/cli/fs_modules/ls_transformer.py
@@ -406,7 +406,7 @@ class LSTransformerDecoder(FairseqIncrementalDecoder):
TransformerDecoderLayer,
)
else:
- from .ls_fs_transformer_decoder_layer import (
+ from lightseq.training.ops.pytorch.ls_fs_transformer_decoder_layer import (
LSFSTransformerDecoderLayer as TransformerDecoderLayer,
)
Oh, i want to make sure our enviroment is the same. Can you provide a based docker image? Now my Enviroment
based docker images nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04
pytorch 1.8.0(compile)
cmake 3.20 (compile)
protobuf and HDF5 follow the https://github.com/bytedance/lightseq/blob/master/docs/inference/build.md
git clone --recursive https://github.com/bytedance/lightseq.git
Then i can run the lightseq but meeting the error.