jyc
jyc
Current theories which I'm testing on the prod server: 1. Nx is not freeing VRAM, CUDA runs out of memory, starts showing us these errors (although I think we'd get...
Thanks! I'm not running in IEx but inside of a Phoenix web app—some processes that call Nx live for a long time (hours), but they don't hold references. I've tried...
Thanks! I am indeed currently using the EXLA compiler. If you think the CUDA_ERROR_INVALID_VALUE bug is more likely to be an out-of-memory issue than an Nx bug, is there a...
Thinking out loud— I took a look at how you can examine memory usage in JAX, and it looks like their `heap_profile` function just gets all the live PyArrays, gets...
> I'd try to use whatever XLA provides for memory tracking per client I don't think XLA provides anything. JAX's `heap_profile` function in `py_client.cc` code that I linked uses `LiveArrays()`...
I checked to see how XLA's Memory Profile Tool works; it consumes NVIDIA CUDA Tools Profiling Interface (CUPTI) events, turning them into XPlane events (?!) and then reading those events,...
Hm, I just ran into the original bug again. I think manually calling `:erlang.garbage_collect` in the processes that were serving requests helped; I had to remove `Nx.backend_deallocate` because it would...
Hm good idea; I don't know what else could be running, but I restarted my computer and things work now. Sorry for the noise and thanks again for making Rectangle!
I ran into this as well after upgrading macOS. @Joss-Steward's idea of vendoring to disable the warning makes sense. In case it helps others: in my case my `mix.exs` already...
Hm. So for a run that didn't hang I actually see more output before "Acquiring the deploy lock": ``` Run kamal deploy --skip-push --version=hash Pull app image... INFO [e7a736ab] Running...