awkward icon indicating copy to clipboard operation
awkward copied to clipboard

`CUDA_ERROR_ILLEGAL_ADDRESS` issue

Open kmohrman opened this issue 3 months ago • 4 comments

Version of Awkward Array

2.8.8

Description and code to reproduce

As reported here, I have encountered some errors that seem to be related to GPU memory, so am opening an issue here as requested by @ianna.

Below is the simplest case I've been able to reproduce it in.

wget http://uaf-10.t2.ucsd.edu/~kmohrman/public_html_backup/files/parquet_files/100k_from_lindsey_file/test_pq_100k.parquet
wget http://uaf-10.t2.ucsd.edu/~kmohrman/public_html_backup/files/py_files/fromLindsey/ak_from_cudf.py
import awkward as ak
import cudf
from ak_from_cudf import cudf_to_awkward
from coffea.nanoevents.methods import candidate
from coffea.nanoevents.methods import vector

def test(filepath,nevents):

    table = cudf.read_parquet(filepath, columns = [
        "Electron_pt", "Electron_eta", "Electron_phi", "Electron_mass", "Electron_charge",
        "MET_pt", "MET_phi",
    ])

    Electron_pt     = cudf_to_awkward(table["Electron_pt"])
    Electron_eta    = cudf_to_awkward(table["Electron_eta"])
    Electron_phi    = cudf_to_awkward(table["Electron_phi"])
    Electron_mass   = cudf_to_awkward(table["Electron_mass"])
    Electron_charge = cudf_to_awkward(table["Electron_charge"])
    MET_pt          = cudf_to_awkward(table["MET_pt"])
    MET_phi         = cudf_to_awkward(table["MET_phi"])

    MET = ak.zip(
        {
            "pt": MET_pt,
            "phi": MET_phi,
        },
        with_name="PolarTwoVector",
        behavior=vector.behavior,
    )[:nevents]

    Electron = ak.zip(
        {
            "pt": Electron_pt,
            "eta": Electron_eta,
            "phi": Electron_phi,
            "mass": Electron_mass,
            "charge": Electron_charge,
            "pdgId": -11 * Electron_charge,
        },
        with_name="PtEtaPhiMCandidate",
        behavior=candidate.behavior,
    )[:nevents]

    leptons = Electron

    # Get pairs of leptons
    ll_pairs = ak.combinations(leptons, 2, fields=["l0","l1"])

    # Get distance from Z
    dist_from_z_all_pairs = abs((ll_pairs.l0+ll_pairs.l1).mass - 91.2)

    # Mask out the pairs that are not SFOS (so that we don't include them when finding the one that's closest to Z)
    # And then of the SFOS pairs, get the index of the one that's cosest to the Z
    sfos_mask = (ll_pairs.l0.pdgId == -ll_pairs.l1.pdgId)
    dist_from_z_sfos_pairs = ak.mask(dist_from_z_all_pairs,sfos_mask)
    sfos_pair_closest_to_z_idx = ak.argmin(dist_from_z_sfos_pairs,axis=-1,keepdims=True)

    print(f"\nDone with {nevents} events!\n")


def main():

    filepath = "test_pq_100k.parquet"
    nevents=2058
    test(filepath,nevents)

main()

This crashes for 2058 events (crashing on the sfos_pair_closest_to_z_idx =... line), but it runs fine for me for 2057 and below. The errors are like MemoryError: std::bad_alloc: CUDA error (failed to allocate 16 bytes) and CUDA_ERROR_ILLEGAL_ADDRESS. The full error is long, so I've included a screenshot of the beginning, and attached the full thing as a text file.

Image

memory_error_sep11_2025.txt

kmohrman avatar Sep 20 '25 16:09 kmohrman

@kmohrman - thanks for opening the issue! I can reproduce it, so it is definitely a bug. What is strange- it works for the same array created manually. It looks like we either don’t get continuous data or have optimized the parents too aggressively.

ianna avatar Sep 21 '25 18:09 ianna

Hi @ianna, just curious if there were any updates on this? Thanks!

kmohrman avatar Oct 13 '25 13:10 kmohrman

Hi @ianna, just wondering if there are any updates on this?

From my end, I'm not really sure how to debug it. I tried investigating further by running ALD Query 8 (where this issue actually arose in practice) but I cannot seem to isolate the issue to a single event. The error varies depending on how many (and which sets of) events I process (the errors include errors about the two arrays not matching in length, errors about None values, errors complaining about negative dimensions, and the memory error that is shown in the MRE here).

So, it's not very clear to me how to effectively debug this, sorry.

kmohrman avatar Dec 02 '25 21:12 kmohrman

Hi @ianna, just wondering if there are any updates on this?

From my end, I'm not really sure how to debug it. I tried investigating further by running ALD Query 8 (where this issue actually arose in practice) but I cannot seem to isolate the issue to a single event. The error varies depending on how many (and which sets of) events I process (the errors include errors about the two arrays not matching in length, errors about None values, errors complaining about negative dimensions, and the memory error that is shown in the MRE here).

So, it's not very clear to me how to effectively debug this, sorry.

Hi @kmohrman. I managed to reproduce it locally. Indeed it is not an easy bug to catch :-)

ianna avatar Dec 02 '25 21:12 ianna