SST icon indicating copy to clipboard operation
SST copied to clipboard

installation and run problem

Open JessieW0806 opened this issue 2 years ago • 20 comments

After when I run command run.sh, it gives the error as follows. importError: /mnt/cache/wangyingjie/SST/mmdet3d/ops/ball_query/ball_query_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at10TensorBase8data_ptrIfEEPT_v

When install FSD, I refers to this: https://github.com/tusen-ai/SST/issues/6 My environment is as follows:

sys.platform: linux Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.2.r11.2/compiler.29618528_0 GCC: gcc (GCC) 5.4.0 PyTorch: 1.9.0+cu111 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.10.0+cu111 OpenCV: 4.6.0 MMCV: 1.3.9 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0+2028b0c MMSegmentation: 0.14.1 MMDetection3D: 0.15.0

JessieW0806 avatar Oct 02 '22 05:10 JessieW0806

Looking forward to your reply!

JessieW0806 avatar Oct 02 '22 05:10 JessieW0806

Thanks for using. Such UndefinedSymbol error is likely caused by incompatible library versions. But I am not sure what goes wrong. Here is my log which contains the related information hope it helps. https://github.com/tusen-ai/SST/files/9689623/sst_waymoD5_1x_3class_8heads_v2.log

Abyssaledge avatar Oct 02 '22 09:10 Abyssaledge

Here is the log fsd_waymoD1_1x.log. I delete the log you posted, which too long to post here. Could you upload it as a file?

Abyssaledge avatar Oct 28 '22 07:10 Abyssaledge

Sorry for that. Here is my log. The value of loss seems to be wrong. Could u please help me out? log.txt

JessieW0806 avatar Oct 28 '22 11:10 JessieW0806

https://github.com/tusen-ai/SST/blob/main/configs/fsd/fsd_waymoD1_1x.py#L236 The batchsize you use is too large. You should scale the number (pos samples limit) along with your batchsize (~ 128 * batchsize)

Abyssaledge avatar Oct 28 '22 11:10 Abyssaledge

Thanks for timely reply! I have changed this setting, however, the loss seems still be wrong. 2022-10-29 10:11:56,315 - mmdet - INFO - Epoch [1][50/79041] lr: 3.000e-05, eta: 8 days, 17:21:32, time: 0.795, data_time: 0.284, memory: 4435, loss_sem_seg: 0.0157, loss_vote: 0.8871, recall_Car: 1.0000, recall_Ped: 0.9800, recall_Cyc: 1.0000, num_clusters: 136.6600, num_fg_points: 427.8600, loss_cls.task0: 0.0096, loss_center.task0: 0.3746, loss_size.task0: 0.1888, loss_rot.task0: 0.0341, loss_cls.task1: 0.0154, loss_center.task1: 0.0000, loss_size.task1: 0.0000, loss_rot.task1: 0.0000, loss_cls.task2: 0.0169, loss_center.task2: 0.0000, loss_size.task2: 0.0000, loss_rot.task2: 0.0000, loss_rcnn_cls: 0.0435, num_pos_rois: 0.0000, num_neg_rois: 314.8600, loss_rcnn_bbox: 0.0000, loss_rcnn_corner: 0.0000, loss: 1.5857, grad_norm: 15.9475

The log txt log.txt is here. Looking forward to your reply.

JessieW0806 avatar Oct 29 '22 03:10 JessieW0806

Could you please point out which part of config you have modified? It's not easy to carefully check the modification you have made according to your log. And why you believe the loss is wrong?

Abyssaledge avatar Oct 30 '22 03:10 Abyssaledge

  1. I do not change any configs, just want to reproduce the result of FSD.
  2. The loss starts at 1 and then goes up to 4. Besides, at the epoch 1, items such as loss_rcnn_bbox are 0. Thanks a lot!

JessieW0806 avatar Oct 31 '22 07:10 JessieW0806

  1. It seems that you modify at least the number of GPUs according to your log, which changes the total batchsize and iterations. Since FSD use syncBN and a iteration-based warmup, it will make a significant difference on performance. So I suggest you check what you modified exactly.
  2. Loss increase is reasonable because we enable the detection part after 4000 iterations.
  3. Zero loss is also caused by you reducing the GPU number, so that 4000 iterations are not enough for a good segmentation warmup, and no foreground points are selected into the detection part.
  4. I suggest users first go through the important parameters in configs before conducting experiments for better handling such issues.

Abyssaledge avatar Oct 31 '22 11:10 Abyssaledge

Thanks for your help! I have another question. As you said "A hotfix is using our code to re-generate the waymo_dbinfo_train.pkl", I have processed Waymo data using the latest mmdet3d version. Do I need to re-generate only the waymo_dbinfo_train.pkl instead of re-processing the whole dataset?

JessieW0806 avatar Nov 24 '22 07:11 JessieW0806

You only need to re-generate waymo_dbinfo_train.pkl. If you know well about the format, you only need to modify the coordinates in this pickle file instead of regeneration. FYI, here are our pickles: https://share.weiyun.com/B3Ss4rid. You could compare them with your local data in case of unexpected bugs.

Abyssaledge avatar Nov 24 '22 09:11 Abyssaledge

log.txt I have checked that my processed data are correct. I also make sure the code is the same as your update. However, the log file is still a lot different from your earlier reply (e.g., strange increase), could you please check it, please?

JessieW0806 avatar Nov 24 '22 12:11 JessieW0806

You could try an experiment without dbsampler. I doubt there is something wrong with it.

Abyssaledge avatar Nov 25 '22 02:11 Abyssaledge

log2.txt I have tested the experiment without dbsampler yesterday, it seems to have the same error.

JessieW0806 avatar Nov 25 '22 03:11 JessieW0806

Send me an email and I will offer you the trained checkpoint. You could use it for inference to see if the results match.

Abyssaledge avatar Nov 25 '22 03:11 Abyssaledge

I have sent it to you.

Abyssaledge avatar Nov 25 '22 05:11 Abyssaledge

res.txt The output results seem to be fine.... Why is the training of the FSD network from scratch not correct? Could you please help me with it?

JessieW0806 avatar Nov 25 '22 07:11 JessieW0806

It's hard to say what is going wrong. You could list the detailed procedure you followed, including gata generation, the code you use, and modifications you made if any. I will try to help.

Abyssaledge avatar Nov 25 '22 08:11 Abyssaledge

Thanks for your reply! I re-install the environment, and the training process seems to be ok, but I still get some questions to ask for your opinion.

  1. when I use the original config, the whole waymo dataset will be used. The training time is 4 days which is much slower than your log. I use the same setting as you provided with 8 A100 gpus and trained on a cluster srun -p ai4science --async --job-name=FSD_s --gres=gpu:8 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=8 --kill-on-bad-exit=1 bash run2.sh What do you guess is the reason for the difference in training time?

  2. If I want to increase samples_per_gpu to 8 (for example), what else do I need to change to get the best performance?

  3. I want to use part of the data set for experiments in the future (one-fifth), so it is OK?

JessieW0806 avatar Nov 30 '22 03:11 JessieW0806

Sorry for late reply. How is it going now?

  1. I don't know, it's hard to say. Check the IO or timing to find the bottleneck.
  2. Increase learning rate and increase the num in IoUNegPiecewiseSampler along with batch size.
  3. If you realize this by setting the load_interval, it will be fine.

Abyssaledge avatar Dec 09 '22 12:12 Abyssaledge