MS-SVConv
MS-SVConv copied to clipboard
Training on 3DMatch dataset is not working
My system configuration:
- System: Ubuntu 18.04
- PyTorch 1.9.0 + CUDA 11.1, A100
(torch-points3d-s_H0q_C5-py3.9) (base) torch-points3d$ pip list
Package Version
--------------------------------- ------------
absl-py 1.0.0
addict 2.4.0
aiohttp 3.7.4.post0
alabaster 0.7.12
antlr4-python3-runtime 4.8
anyio 3.5.0
appdirs 1.4.4
argon2-cffi 21.1.0
async-timeout 3.0.1
attrs 21.2.0
autobahn 21.3.1
Automat 20.2.0
Babel 2.9.1
backcall 0.2.0
backports.entry-points-selectable 1.1.0
beautifulsoup4 4.11.1
bleach 4.1.0
cachetools 4.2.2
certifi 2021.5.30
cffi 1.14.6
chardet 4.0.0
charset-normalizer 2.0.6
click 8.1.2
constantly 15.1.0
cryptography 3.4.8
cycler 0.10.0
debugpy 1.4.3
decorator 5.1.0
defusedxml 0.7.1
deprecation 2.1.0
distlib 0.3.3
docker-pycreds 0.4.0
docutils 0.17.1
entrypoints 0.3
fastjsonschema 2.15.3
filelock 3.1.0
gdown 4.4.0
gitdb 4.0.7
GitPython 3.1.27
google-auth 1.35.0
google-auth-oauthlib 0.4.6
googledrivedownloader 0.4
graphql-core 1.1
grpcio 1.44.0
h5py 3.6.0
hydra-core 1.0.5
hyperlink 21.0.0
idna 3.2
imageio 2.18.0
imagesize 1.2.0
importlib-metadata 4.11.3
incremental 21.3.0
install 1.3.5
ipykernel 6.4.1
ipython 7.28.0
ipython-genutils 0.2.0
ipywidgets 7.7.0
isodate 0.6.0
jedi 0.18.0
Jinja2 3.1.1
joblib 1.0.1
json5 0.9.6
jsonpatch 1.32
jsonpointer 2.1
jsonschema 4.4.0
jupyter-client 7.0.3
jupyter-core 4.8.1
jupyter-packaging 0.12.0
jupyter-server 1.16.0
jupyterlab 3.3.4
jupyterlab-pygments 0.1.2
jupyterlab-server 2.13.0
jupyterlab-widgets 1.0.2
kiwisolver 1.3.2
laspy 2.1.2
lazy-object-proxy 1.6.0
llvmlite 0.38.0
Mako 1.2.0
Markdown 3.3.6
MarkupSafe 2.0.1
matplotlib 3.4.3
matplotlib-inline 0.1.3
MinkowskiEngine 0.5.4
mistune 0.8.4
multidict 5.1.0
nbclassic 0.3.7
nbclient 0.5.4
nbconvert 6.5.0
nbformat 5.3.0
nest-asyncio 1.5.1
networkx 2.6.3
ninja 1.10.2.3
notebook 6.4.4
notebook-shim 0.1.0
numba 0.55.1
numpy 1.19.5
oauthlib 3.1.1
omegaconf 2.0.6
open3d 0.15.2
packaging 21.0
pandas 1.4.2
pandocfilters 1.5.0
param 1.11.1
parso 0.8.2
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.3.2
pip 22.0.4
platformdirs 2.4.0
plyfile 0.7.4
prometheus-client 0.11.0
promise 2.3
prompt-toolkit 3.0.20
protobuf 3.20.1
psutil 5.9.0
ptyprocess 0.7.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
pycuda 2020.1
Pygments 2.10.0
pyparsing 2.4.7
pyquaternion 0.9.9
pyrsistent 0.18.0
PySocks 1.7.1
python-dateutil 2.8.2
python-louvain 0.16
pytools 2022.1.4
pytorch-metric-learning 1.3.0
pytz 2021.1
pyvista 0.34.1
PyWavelets 1.3.0
PyYAML 5.4.1
pyzmq 22.3.0
rdflib 6.1.1
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.7.2
scikit-image 0.19.2
scikit-learn 1.0
scipy 1.6.1
scooby 0.5.12
Send2Trash 1.8.0
sentry-sdk 1.5.10
setproctitle 1.2.3
setuptools 62.1.0
shortuuid 1.0.8
six 1.16.0
smmap 4.0.0
sniffio 1.2.0
snowballstemmer 2.1.0
soupsieve 2.3.2.post1
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
terminado 0.12.1
testpath 0.5.0
threadpoolctl 2.2.0
tifffile 2022.4.22
tinycss2 1.1.1
tomlkit 0.10.2
torch 1.9.0+cu111
torch-cluster 1.6.0
torch-geometric 1.7.2
torch-points-kernels 0.7.0
torch-scatter 2.0.9
torch-sparse 0.6.12
torch-spline-conv 1.2.1
torchaudio 0.9.0
torchfile 0.1.0
torchnet 0.0.4
torchsparse 1.4.0
torchvision 0.10.0+cu111
tornado 6.1
tqdm 4.64.0
traitlets 5.1.0
Twisted 21.7.0
txaio 21.2.1
typing-extensions 3.10.0.2
urllib3 1.26.7
visdom 0.1.8.9
vtk 9.1.0
wandb 0.12.15
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 1.2.1
Werkzeug 2.1.1
wheel 0.37.1
widgetsnbextension 3.6.0
wrapt 1.12.1
wslink 1.0.7
yacs 0.1.8
yapf 0.32.0
yarl 1.6.3
zipp 3.8.0
zope.interface 5.4.0
I launched a training by running command:
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_3head data=registration/fragment3dmatch_sparse training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=200 eval_frequency=10
during training, I found the feat_match_ratio on val and test set remains zero even after ~50 epochs, see the following records for more details: https://wandb.ai/ramdrop/registration/reports/-humanpose1-MS-SVConv---VmlldzoxOTE5Mjg1?accessToken=a1b84890nit3x8cacs2aja05u9zglukq9hb616ym39jbav31ekztml4qihed1t19
I am sorry. I forgot to specify it. The model is stuck in a local minimum. To train MS-SVConv with 3 head you must train MS-SVConv with one head and transfer the weights to 3 heads.
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_1head data=registration/fragment3dmatch_sparse training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=20 eval_frequency=10
Then, the command.
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_3head data=registration/fragment3dmatch_sparse training=sparse_fragment_reg tracker_options.make_submission=True training.wandb.log=True training.batch_size=4 tracker_options.make_submission=True models.path_pretrained= "PATH TO THE .pt model of MS-SVConv with one head"
No problem. I tried the first command (train MS-SVConv with one head)
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_1head data=registration/fragment3dmatch_sparse training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=20 eval_frequency=10
but the training results still did not make sense after 10 epochs (hit ratio and feature maching ratio remain zero):
for more training details: https://wandb.ai/ramdrop/registration/reports/training-results--VmlldzoxOTI3MzE4?accessToken=vjoy1lurnsnd5a050ym1lj9ht6rumez302ofoiq9ggjttw7fgob6e21cqomz2ivy
I got this training curves : https://wandb.ai/humanpose1/registration/reports/MS-SVConv-3DMatch-1head--VmlldzoxOTI3MzUx?accessToken=pxnuheilrl516fl7xrjfyzrhwp7zvlkpxtrwh18id4k14qr4our6q5h1gbuuu55v
hydra-config.zip This is the exact conf file (for the training and not for the fragment generation)
For MS-SVConv with 3 heads: https://wandb.ai/humanpose1/registration/reports/MS-SVConv-3-head-3DMatch--VmlldzoxOTI3NDAw?accessToken=y687z8bnv3ch8mxc2yxmmvjy6je4jmvf0xo69hw4ko4z7yi4un0a6ycl5ynbgf2o
Many thanks for your additional information. With your conf file https://github.com/humanpose1/MS-SVConv/issues/21#issuecomment-1114220016, I trained MS-SVConv with one head using the command:
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_1head data=registration/fragment3dmatch_sparse training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=20 eval_frequency=2
Training results show that :
- my training performance on the val set makes some sense (good news), while does not make sense on the test set.
- somehow my training performance is lower than yours. (My training performance: https://wandb.ai/ramdrop/registration/reports/MS-SVConv-3DMatch-1head--VmlldzoxOTMxNzMx?accessToken=xucwivw8rc7k8vqem283a8uwxqvei1p0h3e38qmfvd3ws7p6qogy5lyrwb9dqit1 Your training performance: https://wandb.ai/humanpose1/registration/reports/MS-SVConv-3DMatch-1head--VmlldzoxOTI3MzUx?accessToken=pxnuheilrl516fl7xrjfyzrhwp7zvlkpxtrwh18id4k14qr4our6q5h1gbuuu55v)
As you said your provided conf file is for the training and not for the fragment generation, could the problem (1) and (2) result from the data preprocessing part?