devito [BUG] Overflow in _C_make_dataobj due to c

Description

When running large 3D grids (e.g., 1295^3 points), Devito fails due to integer overflow in the C structure definition in devito/types/dense.py. The problem originates from the use of ctypes.c_int (32-bit signed integer), which overflows when matrix size exceeds 2^31-1 elements, even if index-mode=int64 and linearize=True are enabled.

This results in incorrect memory size interpretation and crashes in GPU memory allocation.

File

devito/types/dense.py

Affected Section

_C_ctype = POINTER(type(_C_structname, (Structure,),
                            {'_fields_': [(_C_field_data, c_restrict_void_p),
                                          (_C_field_size, POINTER(c_int)),
                                          (_C_field_nbytes, c_ulong),
                                          (_C_field_nopad_size, POINTER(c_ulong)),
                                          (_C_field_domain_size, POINTER(c_ulong)),
                                          (_C_field_halo_size, POINTER(c_int)),
                                          (_C_field_halo_ofs, POINTER(c_int)),
                                          (_C_field_owned_ofs, POINTER(c_int)),
                                          (_C_field_dmap, c_void_p)]}))

Proposed Fix

Replace c_int with c_long in the _C_ctype struct definition to safely handle arrays larger than 2^31 elements:

from ctypes import c_long

_C_ctype = POINTER(type(_C_structname, (Structure,),
                            {'_fields_': [(_C_field_data, c_restrict_void_p),
                                          (_C_field_size, POINTER(c_long)),
                                          (_C_field_nbytes, c_ulong),
                                          (_C_field_nopad_size, POINTER(c_ulong)),
                                          (_C_field_domain_size, POINTER(c_ulong)),
                                          (_C_field_halo_size, POINTER(c_long)),
                                          (_C_field_halo_ofs, POINTER(c_long)),
                                          (_C_field_owned_ofs, POINTER(c_long)),
                                          (_C_field_dmap, c_void_p)]}))

After this modification, all large-domain runs succeed without overflow.

Steps to Reproduce: 1. Use a large 3D simulation (> 2^31 elements): 2. Observe the runtime error (CUDA memory allocation failure, but triggered by invalid size). 3. Inspect dense.py - note c_int usage for size fields.

Observed Behavior

Out of memory allocating 18446744065653020036 bytes of device memory
Failing in Thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2 (CUDA_ERROR_OUT_OF_MEMORY)

The reported allocation size is clearly invalid (≈ 1.8×10^31 bytes), caused by signed 32-bit overflow.

Expected Behavior

Correct allocation for 24.5 GB domain (~2.17×10^9 elements), no overflow, normal execution.

Environment • Devito version: 4.8.20 • Backend: OpenACC / NVHPC 25.1 • MPI: Enabled • Hardware: NVIDIA H100 (80 GB HBM) • OS: HPC cluster environment - Linux • Python: 3.10 (NVHPC env)

Fix verified on GPU backends - replacing c_int with c_long solves the issue fully.

Oct 25 '25 18:10 kelv1n9

This is not a bug.

If you run large problems you should use Operator(...., opt=('advanced', {'index-mode':'int64'}))
The change you made doesn't impact any allocation or size handling. What you did is force ctypes to use incorrect pointer sizes when calling the library which will both lead to incorrect results and likely page faults.

Oct 26 '25 14:10 mloubout

This is not a bug.

If you run large problems you should use Operator(...., opt=('advanced', {'index-mode':'int64'}))

The change you made doesn't impact any allocation or size handling. What you did is force ctypes to use incorrect pointer sizes when calling the library which will both lead to incorrect results and likely page faults.

Hi, thanks for the reply. I already used index-mode='int64', but the overflow still happened. The issue is not in Python indexing but in the C struct _C_ctype in devito/types/dense.py.

That struct uses c_int (32-bit) for fields like size and hofs. When the grid is larger than 2^31 elements, these values overflow before being passed to the compiled kernel. The result is a wrong memory size (for example 18446744065653020036) and a CUDA error.

After changing c_int to c_long, the same large grid (1295^3, about 24.5 GB) runs correctly on GPU and CPU. No crashes, no page faults.

So even with index-mode=int64, the struct itself still limits integer size to 32 bits. It may be safer to switch those fields to c_long or c_int64 on 64-bit systems.

Oct 26 '25 15:10 kelv1n9

Again, thos __fields__ do not have any impact on any part of the memory management, indexing or anything like. Those only inform ctypes of the pointer sizes , i.e the size of the struct, so that ctypes can call the library. Addinitonally all the fields you "changed to c_clong are arrays of int, so unless you are runnign an N^3 grid whereN is larger than int32, it will never overflow.

Basically you made a change that does not do what you think it does and the only reason it "runs" is pure luck. So i would recomment to make an actual MFE that shows the error rather than try to hack something in that only breaks the ctype type matcing and pointer sizes.

Oct 26 '25 22:10 mloubout

Again, thos __fields__ do not have any impact on any part of the memory management, indexing or anything like. Those only inform ctypes of the pointer sizes , i.e the size of the struct, so that ctypes can call the library. Addinitonally all the fields you "changed to c_clong are arrays of int, so unless you are runnign an N^3 grid whereN is larger than int32, it will never overflow.

Basically you made a change that does not do what you think it does and the only reason it "runs" is pure luck. So i would recomment to make an actual MFE that shows the error rather than try to hack something in that only breaks the ctype type matcing and pointer sizes.

I prepared a MFE to demonstrate that the overflow occurs even with opt=('advanced', {'index-mode': 'int64', 'linearize': True}). This is not random, the failure happens before kernel execution, during dataobj creation in dense.py.

Minimal Code

import os, argparse
import time as time
import numpy as np
from devito import configuration, Eq, TimeFunction, Operator, solve, Grid, Constant
from examples.seismic import PointSource, TimeAxis

parser = argparse.ArgumentParser()
parser.add_argument("-n", type=int, default=100)
args = parser.parse_args()

os.environ["DEVITO_PLATFORM"] = "nvidiaX"
os.environ["DEVITO_ARCH"] = "nvc"
os.environ["DEVITO_LANGUAGE"] = "openacc"
configuration["platform"] = "nvidiaX"
configuration["language"] = "openacc"
configuration["compiler"] = "nvc"
configuration["profiling"] = "advanced"
configuration["log-level"] = "WARNING"

print("Starting GPU test")

Lx = Ly = Lz = 1
nx = ny = nz = args.n
dx, dy, dz = Lx / (nx - 1), Ly / (ny - 1), Lz / (nz - 1)
CFL = 0.5
V_liq = 1500
t_max = 0.1 * Lx / V_liq
nbl_z = 30
freq = 1e6
dt = CFL / (np.sqrt(1 / dx**2 + 1 / dy**2 + 1 / dz**2 + 1) * V_liq)
time_range = TimeAxis(start=0.0, stop=t_max, step=dt)
grid = Grid(shape=(nx, ny, nz), extent=(Lx, Ly, Lz))
p = TimeFunction(name="p", grid=grid, time_order=2, space_order=2)

src = PointSource(name="src", grid=grid, time_range=time_range, npoint=1,
                  coordinates=np.array([[Lx / 2, Ly / 2, Lz / 2]]))
src.data[:, 0] = 1000 * np.sin(2 * np.pi * freq * time_range.time_values)

vp = Constant(name="vp"); vp.data = V_liq

eq = Eq((1.0 / vp**2) * p.dt2 - p.laplace, 0)
stencil = solve(eq, p.forward)
step = Eq(p.forward, stencil)
src_term = src.inject(field=p.forward, expr=src)

op = Operator([step] + src_term,
              opt=("advanced", {"gpu-fit": p, "linearize": True, "index-mode": "int64"}))
print(op)
print("Running operator...")
op(time=time_range.num - 1, dt=dt)
print("Success!")

Case 1: Default (`c_int`)

In devito/types/dense.py:

from ctypes import POINTER, Structure, c_int, c_ulong, c_void_p, cast, byref
    ...
    _C_ctype = POINTER(type(_C_structname, (Structure,),
                            {'_fields_': [(_C_field_data, c_restrict_void_p),
                                          (_C_field_size, POINTER(c_int)),
                                          (_C_field_nbytes, c_ulong),
                                          (_C_field_nopad_size, POINTER(c_ulong)),
                                          (_C_field_domain_size, POINTER(c_ulong)),
                                          (_C_field_halo_size, POINTER(c_int)),
                                          (_C_field_halo_ofs, POINTER(c_int)),
                                          (_C_field_owned_ofs, POINTER(c_int)),
                                          (_C_field_dmap, c_void_p)]}))
    ...

def _C_make_dataobj(self, alias=None, **args):
        ...
        dataobj = byref(self._C_ctype._type_())
        dataobj._obj.data = data.ctypes.data_as(c_restrict_void_p)
        dataobj._obj.size = (c_int*self.ndim)(*data.shape)
        dataobj._obj.nbytes = data.nbytes

        # MPI-related fields
        dataobj._obj.npsize = (c_ulong*self.ndim)(*[i - sum(j) for i, j in
                                                    zip(data.shape, self._size_padding)])
        dataobj._obj.dsize = (c_ulong*self.ndim)(*self._size_domain)
        dataobj._obj.hsize = (c_int*(self.ndim*2))(*flatten(self._size_halo))
        dataobj._obj.hofs = (c_int*(self.ndim*2))(*flatten(self._offset_halo))
        dataobj._obj.oofs = (c_int*(self.ndim*2))(*flatten(self._offset_owned))
        ...

Structure generated at runtime:

struct dataobj
{
  void *__restrict data;
  int * size;
  unsigned long nbytes;
  unsigned long * npsize;
  unsigned long * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
  void * dmap;
};

Run:

python gpu_test.py -n 1295

Output:

Starting GPU test
N^3 = 2.17e+09
Matrices size = 24.50 GB
Running operator...
Out of memory allocating 18446744065653020036 bytes of device memory
Failing in Thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2 (CUDA_ERROR_OUT_OF_MEMORY)

This happens before the actual computation: the kernel receives a corrupted value (≈1.84×10^19 bytes). This number is exactly what you get when a negative 32-bit integer (overflowed int) is reinterpreted as unsigned 64-bit during CUDA memory call.

Case 2: After Replacing `c_int` -> `c_long`

Changed imports and structure in dense.py:

from ctypes import POINTER, Structure, c_long, c_ulong, c_void_p, cast, byref
    ...
    _C_ctype = POINTER(type(_C_structname, (Structure,),
                            {'_fields_': [(_C_field_data, c_restrict_void_p),
                                          (_C_field_size, POINTER(c_long)),
                                          (_C_field_nbytes, c_ulong),
                                          (_C_field_nopad_size, POINTER(c_ulong)),
                                          (_C_field_domain_size, POINTER(c_ulong)),
                                          (_C_field_halo_size, POINTER(c_long)),
                                          (_C_field_halo_ofs, POINTER(c_long)),
                                          (_C_field_owned_ofs, POINTER(c_long)),
                                          (_C_field_dmap, c_void_p)]}))
    ...

def _C_make_dataobj(self, alias=None, **args):
        ...
        dataobj = byref(self._C_ctype._type_())
        dataobj._obj.data = data.ctypes.data_as(c_restrict_void_p)
        dataobj._obj.size = (c_long*self.ndim)(*data.shape)
        dataobj._obj.nbytes = data.nbytes

        # MPI-related fields
        dataobj._obj.npsize = (c_ulong*self.ndim)(*[i - sum(j) for i, j in
                                                    zip(data.shape, self._size_padding)])
        dataobj._obj.dsize = (c_ulong*self.ndim)(*self._size_domain)
        dataobj._obj.hsize = (c_long*(self.ndim*2))(*flatten(self._size_halo))
        dataobj._obj.hofs = (c_long*(self.ndim*2))(*flatten(self._offset_halo))
        dataobj._obj.oofs = (c_long*(self.ndim*2))(*flatten(self._offset_owned))
        ...

Generated struct:

struct dataobj
{
  void *__restrict data;
  long * size;
  unsigned long nbytes;
  unsigned long * npsize;
  unsigned long * dsize;
  long * hsize;
  long * hofs;
  long * oofs;
  void * dmap;
};

Run again:

Starting GPU test
N^3 = 2.17e+09
Matrices size = 24.50 GB
Running operator...
Done | time = 00h 00m 22s | 44267.59 MLUP/s
Success!

Summary

The overflow occurs even with index-mode='int64'.
It is triggered inside _C_make_dataobj(), when data.shape (≈2.17×10^9 elements) is packed into (c_int * ndim).
The 32-bit c_int field cannot store large shape values (e.g., 1295^3 > 2^31−1).
Switching to c_long resolves the issue reproducibly on both CPU and GPU.

This is not random success, it’s a reproducible fix that prevents integer overflow when building the dataobj struct.

Maybe the safest long-term solution is to cast these integer fields dynamically based on the selected index-mode (int32 or int64).

Oct 27 '25 06:10 kelv1n9

the issue is probably in the map to device that uses u[0:nx*ny*nz] that should be cast to ``u[0:(long)nx*(long)ny*(long)nz]`. but upcasting all the parameters is not the way to go there is no reason to use lojng for all of those

Oct 31 '25 23:10 mloubout

[BUG] Overflow in _C_make_dataobj due to c_int type

Description

File

Proposed Fix

Observed Behavior

Expected Behavior

Minimal Code

Case 1: Default (c_int)

Case 2: After Replacing c_int -> c_long

Summary

Case 1: Default (`c_int`)

Case 2: After Replacing `c_int` -> `c_long`