[BUG] Overflow in _C_make_dataobj due to c_int type
Description
When running large 3D grids (e.g., 1295^3 points), Devito fails due to integer overflow in the C structure definition in devito/types/dense.py. The problem originates from the use of ctypes.c_int (32-bit signed integer), which overflows when matrix size exceeds 2^31-1 elements, even if index-mode=int64 and linearize=True are enabled.
This results in incorrect memory size interpretation and crashes in GPU memory allocation.
File
devito/types/dense.py
Affected Section
_C_ctype = POINTER(type(_C_structname, (Structure,),
{'_fields_': [(_C_field_data, c_restrict_void_p),
(_C_field_size, POINTER(c_int)),
(_C_field_nbytes, c_ulong),
(_C_field_nopad_size, POINTER(c_ulong)),
(_C_field_domain_size, POINTER(c_ulong)),
(_C_field_halo_size, POINTER(c_int)),
(_C_field_halo_ofs, POINTER(c_int)),
(_C_field_owned_ofs, POINTER(c_int)),
(_C_field_dmap, c_void_p)]}))
Proposed Fix
Replace c_int with c_long in the _C_ctype struct definition to safely handle arrays larger than 2^31 elements:
from ctypes import c_long
_C_ctype = POINTER(type(_C_structname, (Structure,),
{'_fields_': [(_C_field_data, c_restrict_void_p),
(_C_field_size, POINTER(c_long)),
(_C_field_nbytes, c_ulong),
(_C_field_nopad_size, POINTER(c_ulong)),
(_C_field_domain_size, POINTER(c_ulong)),
(_C_field_halo_size, POINTER(c_long)),
(_C_field_halo_ofs, POINTER(c_long)),
(_C_field_owned_ofs, POINTER(c_long)),
(_C_field_dmap, c_void_p)]}))
After this modification, all large-domain runs succeed without overflow.
Steps to Reproduce: 1. Use a large 3D simulation (> 2^31 elements): 2. Observe the runtime error (CUDA memory allocation failure, but triggered by invalid size). 3. Inspect dense.py - note c_int usage for size fields.
Observed Behavior
Out of memory allocating 18446744065653020036 bytes of device memory
Failing in Thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2 (CUDA_ERROR_OUT_OF_MEMORY)
The reported allocation size is clearly invalid (≈ 1.8×10^31 bytes), caused by signed 32-bit overflow.
Expected Behavior
Correct allocation for 24.5 GB domain (~2.17×10^9 elements), no overflow, normal execution.
Environment • Devito version: 4.8.20 • Backend: OpenACC / NVHPC 25.1 • MPI: Enabled • Hardware: NVIDIA H100 (80 GB HBM) • OS: HPC cluster environment - Linux • Python: 3.10 (NVHPC env)
Fix verified on GPU backends - replacing c_int with c_long solves the issue fully.
This is not a bug.
- If you run large problems you should use
Operator(...., opt=('advanced', {'index-mode':'int64'})) - The change you made doesn't impact any allocation or size handling. What you did is force ctypes to use incorrect pointer sizes when calling the library which will both lead to incorrect results and likely page faults.
This is not a bug.
- If you run large problems you should use
Operator(...., opt=('advanced', {'index-mode':'int64'}))- The change you made doesn't impact any allocation or size handling. What you did is force ctypes to use incorrect pointer sizes when calling the library which will both lead to incorrect results and likely page faults.
Hi, thanks for the reply.
I already used index-mode='int64', but the overflow still happened. The issue is not in Python indexing but in the C struct _C_ctype in devito/types/dense.py.
That struct uses c_int (32-bit) for fields like size and hofs. When the grid is larger than 2^31 elements, these values overflow before being passed to the compiled kernel. The result is a wrong memory size (for example 18446744065653020036) and a CUDA error.
After changing c_int to c_long, the same large grid (1295^3, about 24.5 GB) runs correctly on GPU and CPU. No crashes, no page faults.
So even with index-mode=int64, the struct itself still limits integer size to 32 bits. It may be safer to switch those fields to c_long or c_int64 on 64-bit systems.
Again, thos __fields__ do not have any impact on any part of the memory management, indexing or anything like. Those only inform ctypes of the pointer sizes , i.e the size of the struct, so that ctypes can call the library. Addinitonally all the fields you "changed to c_clong are arrays of int, so unless you are runnign an N^3 grid whereN is larger than int32, it will never overflow.
Basically you made a change that does not do what you think it does and the only reason it "runs" is pure luck. So i would recomment to make an actual MFE that shows the error rather than try to hack something in that only breaks the ctype type matcing and pointer sizes.
Again, thos
__fields__do not have any impact on any part of the memory management, indexing or anything like. Those only inform ctypes of the pointer sizes , i.e the size of the struct, so that ctypes can call the library. Addinitonally all the fields you "changed toc_clongare arrays of int, so unless you are runnign anN^3grid whereNis larger than int32, it will never overflow.Basically you made a change that does not do what you think it does and the only reason it "runs" is pure luck. So i would recomment to make an actual MFE that shows the error rather than try to hack something in that only breaks the ctype type matcing and pointer sizes.
I prepared a MFE to demonstrate that the overflow occurs even with
opt=('advanced', {'index-mode': 'int64', 'linearize': True}).
This is not random, the failure happens before kernel execution, during dataobj creation in dense.py.
Minimal Code
import os, argparse
import time as time
import numpy as np
from devito import configuration, Eq, TimeFunction, Operator, solve, Grid, Constant
from examples.seismic import PointSource, TimeAxis
parser = argparse.ArgumentParser()
parser.add_argument("-n", type=int, default=100)
args = parser.parse_args()
os.environ["DEVITO_PLATFORM"] = "nvidiaX"
os.environ["DEVITO_ARCH"] = "nvc"
os.environ["DEVITO_LANGUAGE"] = "openacc"
configuration["platform"] = "nvidiaX"
configuration["language"] = "openacc"
configuration["compiler"] = "nvc"
configuration["profiling"] = "advanced"
configuration["log-level"] = "WARNING"
print("Starting GPU test")
Lx = Ly = Lz = 1
nx = ny = nz = args.n
dx, dy, dz = Lx / (nx - 1), Ly / (ny - 1), Lz / (nz - 1)
CFL = 0.5
V_liq = 1500
t_max = 0.1 * Lx / V_liq
nbl_z = 30
freq = 1e6
dt = CFL / (np.sqrt(1 / dx**2 + 1 / dy**2 + 1 / dz**2 + 1) * V_liq)
time_range = TimeAxis(start=0.0, stop=t_max, step=dt)
grid = Grid(shape=(nx, ny, nz), extent=(Lx, Ly, Lz))
p = TimeFunction(name="p", grid=grid, time_order=2, space_order=2)
src = PointSource(name="src", grid=grid, time_range=time_range, npoint=1,
coordinates=np.array([[Lx / 2, Ly / 2, Lz / 2]]))
src.data[:, 0] = 1000 * np.sin(2 * np.pi * freq * time_range.time_values)
vp = Constant(name="vp"); vp.data = V_liq
eq = Eq((1.0 / vp**2) * p.dt2 - p.laplace, 0)
stencil = solve(eq, p.forward)
step = Eq(p.forward, stencil)
src_term = src.inject(field=p.forward, expr=src)
op = Operator([step] + src_term,
opt=("advanced", {"gpu-fit": p, "linearize": True, "index-mode": "int64"}))
print(op)
print("Running operator...")
op(time=time_range.num - 1, dt=dt)
print("Success!")
Case 1: Default (c_int)
In devito/types/dense.py:
from ctypes import POINTER, Structure, c_int, c_ulong, c_void_p, cast, byref
...
_C_ctype = POINTER(type(_C_structname, (Structure,),
{'_fields_': [(_C_field_data, c_restrict_void_p),
(_C_field_size, POINTER(c_int)),
(_C_field_nbytes, c_ulong),
(_C_field_nopad_size, POINTER(c_ulong)),
(_C_field_domain_size, POINTER(c_ulong)),
(_C_field_halo_size, POINTER(c_int)),
(_C_field_halo_ofs, POINTER(c_int)),
(_C_field_owned_ofs, POINTER(c_int)),
(_C_field_dmap, c_void_p)]}))
...
def _C_make_dataobj(self, alias=None, **args):
...
dataobj = byref(self._C_ctype._type_())
dataobj._obj.data = data.ctypes.data_as(c_restrict_void_p)
dataobj._obj.size = (c_int*self.ndim)(*data.shape)
dataobj._obj.nbytes = data.nbytes
# MPI-related fields
dataobj._obj.npsize = (c_ulong*self.ndim)(*[i - sum(j) for i, j in
zip(data.shape, self._size_padding)])
dataobj._obj.dsize = (c_ulong*self.ndim)(*self._size_domain)
dataobj._obj.hsize = (c_int*(self.ndim*2))(*flatten(self._size_halo))
dataobj._obj.hofs = (c_int*(self.ndim*2))(*flatten(self._offset_halo))
dataobj._obj.oofs = (c_int*(self.ndim*2))(*flatten(self._offset_owned))
...
Structure generated at runtime:
struct dataobj
{
void *__restrict data;
int * size;
unsigned long nbytes;
unsigned long * npsize;
unsigned long * dsize;
int * hsize;
int * hofs;
int * oofs;
void * dmap;
};
Run:
python gpu_test.py -n 1295
Output:
Starting GPU test
N^3 = 2.17e+09
Matrices size = 24.50 GB
Running operator...
Out of memory allocating 18446744065653020036 bytes of device memory
Failing in Thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2 (CUDA_ERROR_OUT_OF_MEMORY)
This happens before the actual computation: the kernel receives a corrupted value (≈1.84×10^19 bytes).
This number is exactly what you get when a negative 32-bit integer (overflowed int) is reinterpreted as unsigned 64-bit during CUDA memory call.
Case 2: After Replacing c_int -> c_long
Changed imports and structure in dense.py:
from ctypes import POINTER, Structure, c_long, c_ulong, c_void_p, cast, byref
...
_C_ctype = POINTER(type(_C_structname, (Structure,),
{'_fields_': [(_C_field_data, c_restrict_void_p),
(_C_field_size, POINTER(c_long)),
(_C_field_nbytes, c_ulong),
(_C_field_nopad_size, POINTER(c_ulong)),
(_C_field_domain_size, POINTER(c_ulong)),
(_C_field_halo_size, POINTER(c_long)),
(_C_field_halo_ofs, POINTER(c_long)),
(_C_field_owned_ofs, POINTER(c_long)),
(_C_field_dmap, c_void_p)]}))
...
def _C_make_dataobj(self, alias=None, **args):
...
dataobj = byref(self._C_ctype._type_())
dataobj._obj.data = data.ctypes.data_as(c_restrict_void_p)
dataobj._obj.size = (c_long*self.ndim)(*data.shape)
dataobj._obj.nbytes = data.nbytes
# MPI-related fields
dataobj._obj.npsize = (c_ulong*self.ndim)(*[i - sum(j) for i, j in
zip(data.shape, self._size_padding)])
dataobj._obj.dsize = (c_ulong*self.ndim)(*self._size_domain)
dataobj._obj.hsize = (c_long*(self.ndim*2))(*flatten(self._size_halo))
dataobj._obj.hofs = (c_long*(self.ndim*2))(*flatten(self._offset_halo))
dataobj._obj.oofs = (c_long*(self.ndim*2))(*flatten(self._offset_owned))
...
Generated struct:
struct dataobj
{
void *__restrict data;
long * size;
unsigned long nbytes;
unsigned long * npsize;
unsigned long * dsize;
long * hsize;
long * hofs;
long * oofs;
void * dmap;
};
Run again:
Starting GPU test
N^3 = 2.17e+09
Matrices size = 24.50 GB
Running operator...
Done | time = 00h 00m 22s | 44267.59 MLUP/s
Success!
Summary
- The overflow occurs even with
index-mode='int64'. - It is triggered inside
_C_make_dataobj(), whendata.shape(≈2.17×10^9 elements) is packed into(c_int * ndim). - The 32-bit
c_intfield cannot store large shape values (e.g., 1295^3 > 2^31−1). - Switching to
c_longresolves the issue reproducibly on both CPU and GPU.
This is not random success, it’s a reproducible fix that prevents integer overflow when building the dataobj struct.
Maybe the safest long-term solution is to cast these integer fields dynamically based on the selected index-mode (int32 or int64).
the issue is probably in the map to device that uses u[0:nx*ny*nz] that should be cast to ``u[0:(long)nx*(long)ny*(long)nz]`. but upcasting all the parameters is not the way to go there is no reason to use lojng for all of those