Aaryan YVS comments

Results 11 comments of


                                            Aaryan YVS

Multi-node training on 2 A100 machines.

Yes, I have been able to ssh from one machine to another using the aliases `ssh genca1001` and `ssh genca1002`. I tried running the `sample.py` using the command `deepspeed --hostfile...

Multi-node training on 2 A100 machines.

The initial error I pointed out while running pdsh `/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory` was resolved. This error came because the file was...

Multi-node training on 2 A100 machines.

No, I just ran the code with the command `NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py` This is the output on first machine: ``` Genc-A100-VM:36759:36970 [0] include/socket.h:421 NCCL WARN Call to recv...

Multi-node training on 2 A100 machines.

> Is there GPU utilization along with GPU memory usage? Yes the GPU utilization is 100% > This is weird as there is no error. Do the 2 nodes use...

Multi-node training on 2 A100 machines.

> Yes, I also observed no progress bar being loaded in multi-node setup and hence I resorted to printing loss every `n` steps. When I killed the process on the...

Multi-node training on 2 A100 machines.

> Did you have print statements as I did to debug as I feel wandb metrics aren't being updated. I added a print statement above and below this line ```...

Added --output option

> @Aaryan369 check [this one](https://github.com/openai/whisper/pull/140) for details on verbosity @bottledmind Thanks for pointing it out, reverted the changes regarding pbar and verbosity in my code.

Jax/Flax pretraining of wav2vec2

Hi, > For me, it looks like the codevector_perplexity will always collapse to value of 2 and stay there which I believe is not a good thing. Also, the constrastive...

Added --output option

Thanks for committing my PR @jongwook and for making the necessary changes! I appreciate your attention to detail and your efforts to improve the project.

unable to start in any mode, lil help?

Not sure if this is what you are looking for, but, If you change the inference_mode parameter to huggingface, you wouldn't need to download any models into the local machine....