earth2mip icon indicating copy to clipboard operation
earth2mip copied to clipboard

🐛[BUG]: Failed to allocate memory for requested buffer of size 1851310080

Open melodicdeath opened this issue 1 year ago • 1 comments

Version

source - main

On which installation method(s) does this occur?

Pip

Describe the issue

I run the example 02_model_comparison:

print("Running Pangu inference") pangu_ds = inference_ensemble.run_basic_inference( pangu_inference_model, n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast) data_source=pangu_data_source, time=time, ) pangu_ds.to_netcdf(f"{output_dir}/pangu_inference_out.nc") print(pangu_ds)


RuntimeError Traceback (most recent call last) in <cell line: 2>() 1 print("Running Pangu inference") ----> 2 pangu_ds = inference_ensemble.run_basic_inference( 3 pangu_inference_model, 4 n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast) 5 data_source=pangu_data_source,

5 frames /usr/local/lib/python3.10/dist-packages/earth2mip/inference_ensemble.py in run_basic_inference(model, n, data_source, time) 284 arrays = [] 285 times = [] --> 286 for k, (time, data, _) in enumerate(model(time, x)): 287 arrays.append(data.cpu().numpy()) 288 times.append(time)

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, time, x, normalize, restart) 247 dt = torch.tensor(self.time_step.total_seconds()) 248 x1 += self.source(x1, time1) * dt --> 249 x1 = self.model_6(x1) 250 yield time1, x1, restart_data 251

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, x) 142 143 def call(self, x): --> 144 return self.forward(x) 145 146 def to(self):

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in forward(self, x) 156 pl = pl.resize(*pl_shape) 157 sl = surface[0] --> 158 plo, slo = self.model(pl, sl) 159 return torch.cat( 160 [

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, fields_pl, fields_sfc) 122 output = bind_output("output", like=fields_pl) 123 output_sfc = bind_output("output_surface", like=fields_sfc) --> 124 self.ort_session.run_with_iobinding(binding) 125 return output, output_sfc 126

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run_with_iobinding(self, iobinding, run_options) 329 :param run_options: See :class:onnxruntime.RunOptions. 330 """ --> 331 self._sess.run_with_iobinding(iobinding._iobinding, run_options) 332 333 def get_tuning_results(self):

RuntimeError: Error in execution: Non-zero status code returned while running BiasSoftmax node. Name:'BiasSoftmax' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

I don't know what went wrong? But I used the same environment to try directly loading pangu_weather_6.onnx and inference,the results are normal.

Environment details

Kaggle,GPU T4 * 2

!pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P0             26W /   70W |   13623MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:00:05.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

melodicdeath avatar Jul 15 '24 14:07 melodicdeath

Sorry,it's not a bug.

  1. Install optional dependencies for Pangu weather: $ pip install .[pangu]
  2. changed n from 24 to 12
  3. only load pangu_weather_6.onnx pangu.load_6(package)

Then that's it.

melodicdeath avatar Jul 17 '24 03:07 melodicdeath

Thanks for the update. I'll close this then.

nbren12 avatar Oct 07 '24 15:10 nbren12