nimlgen
nimlgen
Current "plan": - [x] map gpu - [x] boot gpu (load fw + power management) - [ ] (not planned anymore) interrupts (was exploring this, but seems a minimal kernel...
- [ ] CUDA requests more `target_sm_config_shared_mem_size` (and the same for minimum). NV also never uses option 0x5, since for most kernel is was a lost slower. - [x] Match...
sync: 13.08 ms @ 5.13 GB/s vs sync: 8.98 ms @ 7.47 GB/s
Should be passed nicer somehow, but let's see if that works
from #7727
`Tensor(numpy).realize()` takes ~ 0.85ms to schedule on comma. There are 5 of them, each 0.85ms. The reason why benchmark is slow after #7593. QCOM copies take: ``` copyin 0.02 ms...