nimlgen
nimlgen
What gpu you have? Can you rebase to master and retry?
8m is also pretty slow, gpuocelot is ~3m on the same tests. I will profile remu, but maybe you @Qazalin know what might be slow in remu?
hmm, you think mt really matters running ci? it runs a worker on each thread, so I think we should try just to optimize single threaded?
I have seen `index` to be slow, I think cuz of hashmap. Can we just switch this to array of registers of fixed size? ``` pub struct VGPR { values:...
Can you run with gdb and share backtrace (py-bt)?
``` args_st = self.args_struct_t.from_address(kernargs) for i in range(len(args)): args_st.__setattr__(f'f{i}', args[I]) ``` is `kernargs` 0 here? Do you have an integrated gpu? You can manage visible gpus with https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#rocr-visible-devices if it...
I am trying to reproduce that one more time as well. Somebody who can reproduce that, is args passed to `P_set` before segfault (ptr & value) and the value `kernargs...
notnaton published a trace where segfault happened at function `P_set`: `P_set (ptr=0x7ffec0a00000, value=, size=) at ./Modules/_ctypes/cfield.c:1462`
Yeah, NV should be a bit better but also OOMs. Wrote a custom allocator to fit it better, but it OOM when creating graphs. Is there any way to remove...
Hmm, need to retest. I recall trying to set `cuMemcpyHtoD_v2` in copyin and it still was an OOM during gpu buffer allocation