GLN icon indicating copy to clipboard operation
GLN copied to clipboard

segment fault

Open fengjiaxin opened this issue 5 years ago • 12 comments

hi,excuse me i meet a new issue,when i train the model i meet another issue segment fault core dump would you update the new code,i have no idea to solve the problem

and more: i think GLN/gln/mods/mol_gnn/gnn_family/utils.py can update by replace cuda() to to(DEVICE) thanks a lot

fengjiaxin avatar Feb 18 '20 09:02 fengjiaxin

could you please provide more details for the segfault?

Hanjun-Dai avatar Feb 18 '20 09:02 Hanjun-Dai

./run_mf.sh: 行 60: 9301 段错误 (吐核)python ../main.py -gm $gm -fp_degree 2 -neg_sample $neg_sample -att_type $att_type -gnn_out $gnn_out -tpl_enc $tpl_enc -subg_enc $subg_enc -latent_dim $msg_dim -bn $bn -gen_method $gen -retro_during_train $retro -neg_num $neg_size -embed_dim $embed_dim -readout_agg_type $graph_agg -act_func $act -act_last True -max_lv $lv -dropbox $dropbox -data_name $data_name -save_dir $save_dir -tpl_name $tpl_name -f_atoms $dropbox/cooked_$data_name/atom_list.txt -iters_per_val 3000 -gpu 1 -topk 50 -beam_size 50 -num_parts 1

no other information, i think its not environment issue

fengjiaxin avatar Feb 18 '20 10:02 fengjiaxin

are you able to run the test with existing model dumps?

Hanjun-Dai avatar Feb 18 '20 10:02 Hanjun-Dai

and did you modify the script?

I use -gpu 0 in the script. Please try with the vanilla code and see if that works

Hanjun-Dai avatar Feb 18 '20 10:02 Hanjun-Dai

get another issue gpu cuda error are ckpt file saved by gpu?

fengjiaxin avatar Feb 18 '20 10:02 fengjiaxin

i use -gpu 1 ,and did you save the model by gpu 0, i run test script by error as follows:

Traceback (most recent call last): File "main_test.py", line 139, in model = RetroGLN(cmd_args.dropbox, local_args.model_for_test) File "/home/fengjiaxin/GLN/gln/test/model_inference.py", line 43, in init self.gln.load_state_dict(torch.load(model_file)) File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 426, in load return _load(f, map_location, pickle_module, **pickle_load_args) File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 613, in _load result = unpickler.load() File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 576, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 155, in default_restore_location result = fn(storage, location) File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 135, in _cuda_deserialize return storage_type(obj.size()) File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/cuda/init.py", line 634, in _lazy_new return super(_CudaBase, cls).new(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory

fengjiaxin avatar Feb 18 '20 11:02 fengjiaxin

yes it uses gpu by default. Please always use -gpu 0 in your script. If you want to change GPU, please use CUDA_VISIBLE_DEVICES instead

Hanjun-Dai avatar Feb 18 '20 19:02 Hanjun-Dai

hi , i debug the code ,some error at GLN/gln/graph_logic/soft_logic.py line 29 jagged_forward graph_embed = graph_enc(list) no other information can you introduce your code in brief i can not find the error thanks

fengjiaxin avatar Feb 24 '20 07:02 fengjiaxin

can you give a docker image? i think it will be useful

fengjiaxin avatar Feb 24 '20 09:02 fengjiaxin

graph_enc is from another sub package in this repo.

Can you first try without GPU? Please take a look at this: https://discuss.pytorch.org/t/on-a-cpu-device-how-to-load-checkpoint-saved-on-gpu-device/349

to see how to load a gpu dump into cpu

Hanjun-Dai avatar Feb 25 '20 00:02 Hanjun-Dai

hi, i debug the traing file and test file got the same error ,not cuda error would you introduce your code in brief ,thanks

fengjiaxin avatar Feb 25 '20 01:02 fengjiaxin

If the error is happening in that line, you may double check the https://github.com/Hanjun-Dai/GLN/blob/master/gln/mods/mol_gnn/gnn_family/utils.py#L64

note that different graph nn implementation will override this function.

Hanjun-Dai avatar Feb 25 '20 20:02 Hanjun-Dai