NanoCode012 comments

Results 163 comments of


                                            NanoCode012

Extension error on launch

Hi @andreasfragner and those reading , I got this error when trying to use `jupyterlab_tensorboard` (frontend of this repo for jupyterlab) https://github.com/chaoleili/jupyterlab_tensorboard/issues/25#issue-684139674 . I fixed it by the following. >...

Tensorboard not having scalar tab

Hi @TimJLS , did you find a solution to this? Actually, you can open Scalar tab by pressing on "Inactive>Scalar", but I am curious why it happens.

Multi-GPU Training 🌟

``` store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use ``` Hello @feizhouxiaozhu , I think this may be because you are running multiple trainings at a...

Multi-GPU Training 🌟

Hmm, I'm not sure why that is. @feizhouxiaozhu , could you try to re-clone the repo then try again? If error still occurs, could you try to run on coco128?...

Multi-GPU Training 🌟

Hi @cesarandreslopez , nice numbers! The reason GPU 0 has higher memory is because it has to communicate with other GPUs to coordinate. In my test however, I don’t see...

Multi-GPU Training 🌟

@glenn-jocher , I just noticed that! That may be why the memory is so different. But now it’s up to optimizations. For small “total batch size”, it makes sense to...

Multi-GPU Training 🌟

If we do so, datasets for testing could take num_gpu times longer. (I remember training/testing with total batchsize 16 for coco taking 1h) . I think giving user an option...

Multi-GPU Training 🌟

Hello @liumingjune, could you provide us the exact line you used? EDIT: Also, did you use the latest repo? I think this can be the reason.

Multi-GPU Training 🌟

Hi @liumingjune , could you try to `pull` or `clone` the repo again? I saw that your `hyp` values are old, and `train` function is missing some arguments. I ran...

Multi-GPU Training 🌟

> well, i got the same problem with @feizhouxiaozhu if I set > `--nproc_per_node 6 or 8 ', > 2 or 4 is OK. > > `python3 -m torch.distributed.launch --master_port...