NanoCode012
NanoCode012
Hi @andreasfragner and those reading , I got this error when trying to use `jupyterlab_tensorboard` (frontend of this repo for jupyterlab) https://github.com/chaoleili/jupyterlab_tensorboard/issues/25#issue-684139674 . I fixed it by the following. >...
Hi @TimJLS , did you find a solution to this? Actually, you can open Scalar tab by pressing on "Inactive>Scalar", but I am curious why it happens.
``` store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use ``` Hello @feizhouxiaozhu , I think this may be because you are running multiple trainings at a...
Hmm, I'm not sure why that is. @feizhouxiaozhu , could you try to re-clone the repo then try again? If error still occurs, could you try to run on coco128?...
Hi @cesarandreslopez , nice numbers! The reason GPU 0 has higher memory is because it has to communicate with other GPUs to coordinate. In my test however, I donβt see...
@glenn-jocher , I just noticed that! That may be why the memory is so different. But now itβs up to optimizations. For small βtotal batch sizeβ, it makes sense to...
If we do so, datasets for testing could take num_gpu times longer. (I remember training/testing with total batchsize 16 for coco taking 1h) . I think giving user an option...
Hello @liumingjune, could you provide us the exact line you used? EDIT: Also, did you use the latest repo? I think this can be the reason.
Hi @liumingjune , could you try to `pull` or `clone` the repo again? I saw that your `hyp` values are old, and `train` function is missing some arguments. I ran...
> well, i got the same problem with @feizhouxiaozhu if I set > `--nproc_per_node 6 or 8 ', > 2 or 4 is OK. > > `python3 -m torch.distributed.launch --master_port...