DeepRule icon indicating copy to clipboard operation
DeepRule copied to clipboard

torch problem

Open tianqibucuo0 opened this issue 1 year ago • 13 comments

my cuda version is 11.7, but cuda version is 8.0 in DeepRule.txt, could i download 11.7?

tianqibucuo0 avatar May 18 '23 03:05 tianqibucuo0

I have add the new environment file see updates and is able to complie cpools layers

soap117 avatar May 18 '23 06:05 soap117

thank you very much!

tianqibucuo0 avatar May 18 '23 07:05 tianqibucuo0

hello, requirement-2023.txt have 33 packages, but DeepRule.txt have 96 packages, other packages not need download?

tianqibucuo0 avatar May 18 '23 07:05 tianqibucuo0

Generally not I have tested it, if found someone is missing, just install it.

soap117 avatar Jun 16 '23 21:06 soap117

Hello, I am training a model using "linedata(1028)" and encountered two errors. Could you please help me? 1、DeepRule-master/models/py_utils/kp_utils.py:592: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/cuda/Indexing.cu:1239.) tag_full[1-mask_full] = 0 2、python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Segmentation fault (core dumped)

tianqibucuo0 avatar Jun 21 '23 02:06 tianqibucuo0

For first I think you can use type_as to torch.float32 before the masked_fill_ command

soap117 avatar Jun 21 '23 22:06 soap117

Thank you, after fixing all the UserWarning errors, I encountered the error "Segmentation fault (core dumped)" during the execution. Here is my execution process. Can you please explain why this is happening?

(DeepRule) sun@sun:~/DeepRule-master$ python train_chart.py --cfg_file CornerNetLine --data_dir "/home/sun/data/linedata(1028)" --cache_path "/home/sun/data/linedata(1028)/cache/" :228: RuntimeWarning: compiletime version 3.6 of module 'pycocotools._mask' does not match runtime version 3.9 :228: RuntimeWarning: builtins.type size changed, may indicate binary incompatibility. Expected 864 from C header, got 880 from PyObject ./config/CornerNetLine.json ['cache', 'line'] loading all datasets... using 1 threads loading from cache file: /home/sun/data/linedata(1028)/cache/line_train2019.pkl loading annotations into memory... /home/sun/data/linedata(1028)/line/annotations/instancesLine(1023)_train2019.json Done (t=2.72s) creating index... index created! loading from cache file: /home/sun/data/linedata(1028)/cache/line_val2019.pkl loading annotations into memory... /home/sun/data/linedata(1028)/line/annotations/instancesLine(1023)_val2019.json Done (t=0.05s) creating index... index created! system config... {'batch_size': 5, 'cache_dir': '/home/sun/yangshaohan/618/data/linedata(1028)/cache/', 'chunk_sizes': [5, 7, 7, 7], 'config_dir': './config', 'data_dir': '/home/sun/yangshaohan/618/data/linedata(1028)', 'data_rng': RandomState(MT19937) at 0x7FE69C7CB340, 'dataset': 'Line', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 50000, 'nnet_rng': RandomState(MT19937) at 0x7FE69C7CB440, 'opt_algo': 'adam', 'prefetch_size': 5, 'pretrain': None, 'result_dir': './results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CornerNetLine', 'stepsize': 45000, 'tar_data_dir': 'cls', 'test_split': 'testchart', 'train_split': 'trainchart', 'val_iter': 100, 'val_split': 'valchart', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 1, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.3, 'gaussian_radius': -1, 'input_size': [511, 511], 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 200, 'weight_exp': 8} len of db: 116745 building model... module_file: models.CornerNetLine use kp total parameters: 198592138 setting learning rate to: 0.00025 training start... start prefetching data... shuffling indices... ['read.txt'] 0%| | 0/50000 [00:00<?, ?it/s] Segmentation fault (core dumped)

tianqibucuo0 avatar Jun 27 '23 07:06 tianqibucuo0

Sounds like the Cornernet package problem. Follow the instructions to compile it.

soap117 avatar Jun 27 '23 17:06 soap117

Hello, after recompiling, the same problem still persists. Could you please provide the versions of Python, CUDA, and GCC specified in the requirements-2023.txt file? Additionally, I would like to know the amount of GPU memory required for training "line" model.

tianqibucuo0 avatar Jul 01 '23 01:07 tianqibucuo0

Package Version


adal 1.2.7 argcomplete 2.1.2 azure-common 1.1.28 azure-core 1.27.1 azure-graphrbac 0.61.1 azure-mgmt-authorization 3.0.0 azure-mgmt-containerregistry 10.1.0 azure-mgmt-core 1.4.0 azure-mgmt-keyvault 10.2.2 azure-mgmt-resource 22.0.0 azure-mgmt-storage 21.0.0 azureml 0.2.7 azureml-core 1.52.0 backports.tempfile 1.0 backports.weakref 1.0.post1 bcrypt 4.0.1 certifi 2023.5.7 cffi 1.15.1 charset-normalizer 3.1.0 contextlib2 21.6.0 contourpy 1.0.5 cryptography 41.0.1 cycler 0.11.0 docker 6.1.3 fonttools 4.25.0 h5py 3.8.0 humanfriendly 10.0 idna 3.4 importlib-resources 5.2.0 isodate 0.6.1 jeepney 0.8.0 jmespath 1.0.1 jsonpickle 3.0.1 kiwisolver 1.4.4 knack 0.10.1 matplotlib 3.7.1 mkl-fft 1.3.6 mkl-random 1.2.2 mkl-service 2.4.0 msal 1.22.0 msal-extensions 1.0.0 msrest 0.7.1 msrestazure 0.6.4 munkres 1.1.4 ndg-httpsclient 0.5.1 numpy 1.24.3 oauthlib 3.2.2 opencv-python 4.7.0.72 packaging 23.0 pandas 2.0.3 paramiko 3.2.0 pathspec 0.11.1 Pillow 9.4.0 pip 23.0.1 pkginfo 1.9.6 ply 3.11 portalocker 2.7.0 pyasn1 0.5.0 pycparser 2.21 Pygments 2.15.1 PyJWT 2.7.0 PyNaCl 1.5.0 pyOpenSSL 23.2.0 pyparsing 3.0.9 PyQt5-sip 12.11.0 PySocks 1.7.1 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0 requests 2.30.0 requests-oauthlib 1.3.1 SecretStorage 3.3.3 setuptools 66.0.0 sip 6.6.2 six 1.16.0 tabulate 0.9.0 toml 0.10.2 torch 1.7.1+cu110 torchaudio 0.7.2 torchvision 0.8.2+cu110 tornado 6.2 typing_extensions 4.5.0 tzdata 2023.3 urllib3 1.26.16 websocket-client 1.6.1 wheel 0.38.4 I am able to run the train code

soap117 avatar Jul 01 '23 03:07 soap117

Thank you for your response. There is no information available here regarding Python, CUDA, and GCC, which could be due to different versions. Could you please provide me with the relevant information?

ysh @.***

 

------------------ 原始邮件 ------------------ 发件人: "soap117/DeepRule" @.>; 发送时间: 2023年7月1日(星期六) 中午11:27 @.>; @.@.>; 主题: Re: [soap117/DeepRule] torch problem (Issue #32)

Package Version

adal 1.2.7 argcomplete 2.1.2 azure-common 1.1.28 azure-core 1.27.1 azure-graphrbac 0.61.1 azure-mgmt-authorization 3.0.0 azure-mgmt-containerregistry 10.1.0 azure-mgmt-core 1.4.0 azure-mgmt-keyvault 10.2.2 azure-mgmt-resource 22.0.0 azure-mgmt-storage 21.0.0 azureml 0.2.7 azureml-core 1.52.0 backports.tempfile 1.0 backports.weakref 1.0.post1 bcrypt 4.0.1 certifi 2023.5.7 cffi 1.15.1 charset-normalizer 3.1.0 contextlib2 21.6.0 contourpy 1.0.5 cryptography 41.0.1 cycler 0.11.0 docker 6.1.3 fonttools 4.25.0 h5py 3.8.0 humanfriendly 10.0 idna 3.4 importlib-resources 5.2.0 isodate 0.6.1 jeepney 0.8.0 jmespath 1.0.1 jsonpickle 3.0.1 kiwisolver 1.4.4 knack 0.10.1 matplotlib 3.7.1 mkl-fft 1.3.6 mkl-random 1.2.2 mkl-service 2.4.0 msal 1.22.0 msal-extensions 1.0.0 msrest 0.7.1 msrestazure 0.6.4 munkres 1.1.4 ndg-httpsclient 0.5.1 numpy 1.24.3 oauthlib 3.2.2 opencv-python 4.7.0.72 packaging 23.0 pandas 2.0.3 paramiko 3.2.0 pathspec 0.11.1 Pillow 9.4.0 pip 23.0.1 pkginfo 1.9.6 ply 3.11 portalocker 2.7.0 pyasn1 0.5.0 pycparser 2.21 Pygments 2.15.1 PyJWT 2.7.0 PyNaCl 1.5.0 pyOpenSSL 23.2.0 pyparsing 3.0.9 PyQt5-sip 12.11.0 PySocks 1.7.1 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0 requests 2.30.0 requests-oauthlib 1.3.1 SecretStorage 3.3.3 setuptools 66.0.0 sip 6.6.2 six 1.16.0 tabulate 0.9.0 toml 0.10.2 torch 1.7.1+cu110 torchaudio 0.7.2 torchvision 0.8.2+cu110 tornado 6.2 typing_extensions 4.5.0 tzdata 2023.3 urllib3 1.26.16 websocket-client 1.6.1 wheel 0.38.4 I am able to run the train code

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

tianqibucuo0 avatar Jul 01 '23 04:07 tianqibucuo0

Hello, my GPU is relatively small, so I modified train.json and val.json files to keep only 10 data entries for testing purposes. However, when it reaches the line "training = pinned_training_queue.get(block=True)", the execution gets stuck and does not proceed. Below is my execution process. Can you please tell me the reason for this?

/home/ubuntu/anaconda3/envs/myenv/bin/python /home/ubuntu/download/pycharm-community-2023.1.4/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 44227 --file /media/ubuntu/A4823F1E823EF480/2023/env/python/DeepRule-master-weixiugai/DeepRule-master/train_chart.py Connected to pydev debugger (build 231.9225.15) /home/ubuntu/anaconda3/envs/myenv/lib/python3.6/site-packages/OpenSSL/_util.py:6: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6. from cryptography.hazmat.bindings.openssl.binding import Binding Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (pyOpenSSL 23.2.0 (/home/ubuntu/anaconda3/envs/myenv/lib/python3.6/site-packages), Requirement.parse('pyopenssl<23.0.0')). /media/ubuntu/A4823F1E823EF480/2023/env/python/DeepRule-master-weixiugai/DeepRule-master/train_chart.py:22: FutureWarning: azureml.core: AzureML support for Python 3.6 is deprecated and will be dropped in an upcoming release. At that point, existing Python 3.6 workflows that use AzureML will continue to work without modification, but Python 3.6 users will no longer get access to the latest AzureML features and bugfixes. We recommend that you upgrade to Python 3.7 or newer. To disable SDK V1 deprecation warning set the environment variable AZUREML_DEPRECATE_WARNING to 'False' from azureml.core.run import Run ['line'] loading all datasets... using 1 threads loading from cache file: /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/line_train2019.pkl loading annotations into memory... /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/annotations/instancesLine(1023)_train2019.json Done (t=0.00s) creating index... index created! loading from cache file: /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/line_val2019.pkl loading annotations into memory... /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/annotations/instancesLine(1023)_val2019.json Done (t=0.00s) creating index... index created! system config... {'batch_size': 5, 'cache_dir': '/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line', 'chunk_sizes': [5, 7, 7, 7], 'config_dir': './config', 'data_dir': '/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)', 'data_rng': RandomState(MT19937) at 0x7FCC248FF258, 'dataset': 'Line', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.01, 'max_iter': 50000, 'nnet_rng': RandomState(MT19937) at 0x7FCC248FF570, 'opt_algo': 'adam', 'prefetch_size': 5, 'pretrain': None, 'result_dir': './results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CornerNetLine', 'stepsize': 45000, 'tar_data_dir': 'cls', 'test_split': 'testchart', 'train_split': 'trainchart', 'val_iter': 100, 'val_split': 'valchart', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 1, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.3, 'gaussian_radius': -1, 'input_size': [511, 511], 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 200, 'weight_exp': 8} len of db: 11 building model... module_file: models.CornerNetLine use kp total parameters: 198592138 setting learning rate to: 0.01 training start... start prefetching data... ['read.txt'] 0%| | 0/50000 [00:00<?, ?it/s]

tianqibucuo0 avatar Jul 14 '23 08:07 tianqibucuo0

I am currently facing a simmilar issue. Did you manage to find a soultion to this?

LouisPouliot avatar Sep 12 '23 12:09 LouisPouliot