HKO-7
HKO-7 copied to clipboard
Trouble with running on GPU
Hi there,
Thanks for making this work publicly available.
I managed to run your code for my own dataset on CPU but my attempt to run it on GPU hasn't worked yet due to simple_bind error
. For what it's worth, I'm running this on an EC2 instance with GPUs and Deep Learning AMI (mxnet p3.6 and CUDA 10.0). Wondering if you've seen this issue before and any clues?
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1832, in simple_bind
ctypes.byref(exe_handle)))
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: _Map_base: :at
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "sst_main.py", line 432, in <module>
train(args)
File "sst_main.py", line 330, in train
factory=sst_nowcasting, context=args.ctx
File "/home/ubuntu/STS-ConvLSTM/nowcasting/encoder_forecaster.py", line 575, in encoder_forecaster_build_networks
shared_module=shared_encoder_net,
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/module.py", line 429, in bind
state_names=self._state_names)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 280, in __init__
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 384, in bind_exec
shared_group))
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 678, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1838, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (5, 4, 1, 480, 480)
ebrnn1_begin_state_h: (4, 64, 96, 96)
ebrnn2_begin_state_h: (4, 192, 32, 32)
ebrnn3_begin_state_h: (4, 192, 16, 16)
_Map_base: :at
Seems to be related to the latest MXNet
@pflashgary Thanks for the question, I haven't run the source code for a while and the bug seems to be related to MXNet. Which version of MXNet are you currently using?
Hi Xingjian, Thanks for getting back to me; mxnet p3.6 and CUDA 10.0. Can I ask your version of mxnet and CUDA so that I can compare?
Hi Does this problem have a solution?
Thanks
@pflashgary and @sulisetyowidodo I currently do not have bandwidth to check which version of MXNet works for the latest CUDA.
We have switched the development to PyTorch and you may check our latest Earthformer paper: https://github.com/amazon-science/earth-forecasting-transformer