HKO-7 icon indicating copy to clipboard operation
HKO-7 copied to clipboard

Trouble with running on GPU

Open pflashgary opened this issue 3 years ago • 5 comments

Hi there, Thanks for making this work publicly available. I managed to run your code for my own dataset on CPU but my attempt to run it on GPU hasn't worked yet due to simple_bind error. For what it's worth, I'm running this on an EC2 instance with GPUs and Deep Learning AMI (mxnet p3.6 and CUDA 10.0). Wondering if you've seen this issue before and any clues?

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1832, in simple_bind
    ctypes.byref(exe_handle)))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: _Map_base: :at

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sst_main.py", line 432, in <module>
    train(args)
  File "sst_main.py", line 330, in train
    factory=sst_nowcasting, context=args.ctx
  File "/home/ubuntu/STS-ConvLSTM/nowcasting/encoder_forecaster.py", line 575, in encoder_forecaster_build_networks
    shared_module=shared_encoder_net,
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/module.py", line 429, in bind
    state_names=self._state_names)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 280, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 384, in bind_exec
    shared_group))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 678, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1838, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (5, 4, 1, 480, 480)
ebrnn1_begin_state_h: (4, 64, 96, 96)
ebrnn2_begin_state_h: (4, 192, 32, 32)
ebrnn3_begin_state_h: (4, 192, 16, 16)
_Map_base: :at

pflashgary avatar Apr 09 '21 03:04 pflashgary

Seems to be related to the latest MXNet

sxjscience avatar May 02 '21 20:05 sxjscience

@pflashgary Thanks for the question, I haven't run the source code for a while and the bug seems to be related to MXNet. Which version of MXNet are you currently using?

sxjscience avatar May 03 '21 16:05 sxjscience

Hi Xingjian, Thanks for getting back to me; mxnet p3.6 and CUDA 10.0. Can I ask your version of mxnet and CUDA so that I can compare?

pflashgary avatar May 03 '21 21:05 pflashgary

Hi Does this problem have a solution?

Thanks

sulisetyowidodo avatar Mar 05 '23 14:03 sulisetyowidodo

@pflashgary and @sulisetyowidodo I currently do not have bandwidth to check which version of MXNet works for the latest CUDA.

We have switched the development to PyTorch and you may check our latest Earthformer paper: https://github.com/amazon-science/earth-forecasting-transformer

sxjscience avatar Mar 05 '23 16:03 sxjscience