`CUDA_ERROR_ILLEGAL_ADDRESS` issue
Version of Awkward Array
2.8.8
Description and code to reproduce
As reported here, I have encountered some errors that seem to be related to GPU memory, so am opening an issue here as requested by @ianna.
Below is the simplest case I've been able to reproduce it in.
wget http://uaf-10.t2.ucsd.edu/~kmohrman/public_html_backup/files/parquet_files/100k_from_lindsey_file/test_pq_100k.parquet
wget http://uaf-10.t2.ucsd.edu/~kmohrman/public_html_backup/files/py_files/fromLindsey/ak_from_cudf.py
import awkward as ak
import cudf
from ak_from_cudf import cudf_to_awkward
from coffea.nanoevents.methods import candidate
from coffea.nanoevents.methods import vector
def test(filepath,nevents):
table = cudf.read_parquet(filepath, columns = [
"Electron_pt", "Electron_eta", "Electron_phi", "Electron_mass", "Electron_charge",
"MET_pt", "MET_phi",
])
Electron_pt = cudf_to_awkward(table["Electron_pt"])
Electron_eta = cudf_to_awkward(table["Electron_eta"])
Electron_phi = cudf_to_awkward(table["Electron_phi"])
Electron_mass = cudf_to_awkward(table["Electron_mass"])
Electron_charge = cudf_to_awkward(table["Electron_charge"])
MET_pt = cudf_to_awkward(table["MET_pt"])
MET_phi = cudf_to_awkward(table["MET_phi"])
MET = ak.zip(
{
"pt": MET_pt,
"phi": MET_phi,
},
with_name="PolarTwoVector",
behavior=vector.behavior,
)[:nevents]
Electron = ak.zip(
{
"pt": Electron_pt,
"eta": Electron_eta,
"phi": Electron_phi,
"mass": Electron_mass,
"charge": Electron_charge,
"pdgId": -11 * Electron_charge,
},
with_name="PtEtaPhiMCandidate",
behavior=candidate.behavior,
)[:nevents]
leptons = Electron
# Get pairs of leptons
ll_pairs = ak.combinations(leptons, 2, fields=["l0","l1"])
# Get distance from Z
dist_from_z_all_pairs = abs((ll_pairs.l0+ll_pairs.l1).mass - 91.2)
# Mask out the pairs that are not SFOS (so that we don't include them when finding the one that's closest to Z)
# And then of the SFOS pairs, get the index of the one that's cosest to the Z
sfos_mask = (ll_pairs.l0.pdgId == -ll_pairs.l1.pdgId)
dist_from_z_sfos_pairs = ak.mask(dist_from_z_all_pairs,sfos_mask)
sfos_pair_closest_to_z_idx = ak.argmin(dist_from_z_sfos_pairs,axis=-1,keepdims=True)
print(f"\nDone with {nevents} events!\n")
def main():
filepath = "test_pq_100k.parquet"
nevents=2058
test(filepath,nevents)
main()
This crashes for 2058 events (crashing on the sfos_pair_closest_to_z_idx =... line), but it runs fine for me for 2057 and below. The errors are like MemoryError: std::bad_alloc: CUDA error (failed to allocate 16 bytes) and CUDA_ERROR_ILLEGAL_ADDRESS. The full error is long, so I've included a screenshot of the beginning, and attached the full thing as a text file.
@kmohrman - thanks for opening the issue! I can reproduce it, so it is definitely a bug. What is strange- it works for the same array created manually. It looks like we either don’t get continuous data or have optimized the parents too aggressively.
Hi @ianna, just curious if there were any updates on this? Thanks!
Hi @ianna, just wondering if there are any updates on this?
From my end, I'm not really sure how to debug it. I tried investigating further by running ALD Query 8 (where this issue actually arose in practice) but I cannot seem to isolate the issue to a single event. The error varies depending on how many (and which sets of) events I process (the errors include errors about the two arrays not matching in length, errors about None values, errors complaining about negative dimensions, and the memory error that is shown in the MRE here).
So, it's not very clear to me how to effectively debug this, sorry.
Hi @ianna, just wondering if there are any updates on this?
From my end, I'm not really sure how to debug it. I tried investigating further by running ALD Query 8 (where this issue actually arose in practice) but I cannot seem to isolate the issue to a single event. The error varies depending on how many (and which sets of) events I process (the errors include errors about the two arrays not matching in length, errors about None values, errors complaining about negative dimensions, and the memory error that is shown in the MRE here).
So, it's not very clear to me how to effectively debug this, sorry.
Hi @kmohrman. I managed to reproduce it locally. Indeed it is not an easy bug to catch :-)