Jakub Beránek
Jakub Beránek
Measure overhead of HQ so far: - Barbora/Karolina - 1ms - Notebook - 0.1ms Overhead on IT4I clusters is caused by old GLIBC and slow `fork` (https://github.com/rust-lang/rust/issues/87764).
I think that this is now superseded by the HQ server event log. @spirali?
Hi :) So there are a few things to unpack here. > Possible that I might have miss-configured, I'm a bit uncertain about the backlog flag, and e.g setting that...
Ok, this definitely sounds suspicious and unintended. I will try to simulate this situation and see if I can fix it.
I implemented a change that should fix the behaviour in this situation. However, we will need to make larger changes to the autoallocator to make it more robust and "smarter",...
Interesting. Is there some proper defined way of finding out which GPUs are actually enabled for the user? Trying to open `/dev/nvidia` seems a bit "fishy". In general, we tried...
Could you please provide some more context? On which cluster does this happen, what is the Nvidia/CUDA configuration used there? Is there e.g. some environment variable that could be used...
Thanks for the details. Using `CUDA_VISIBLE_DEVICES` makes sense to us, since it is also what users are used to, so they could expect that `CUDA_VISIBLE_DEVICES=1,2 hq worker start` will set...
Hi, that looks really weird. Basically, what the autoallocator does it that it first tries to discover the path to the `hq` binary from `/proc/self/exe` so that it knows how...
We will try to detect `(deleted)` in the executable path and provide a better warning to the user. I'll close this issue once it's implemented.