tnnandi
tnnandi
Thanks for the info @amyeroberts ! @parambharat Can I please know if this is a known issue, or am I not using it appropriately?
Hi @parambharat , please find below the HF trainer code along with the job submission file (please make the required changes based on your environment): test_wandb.py: ``` import datetime import...
HI @parambharat , please let me know if you could reproduce this issue at your end. Feel free to ask for clarifications!
Thanks for the heads up, @parambharat , your effort is very much appreciated.
Hi @tcapelle , thank you for the response. Here's a screenshot for GPU usage from one of my runs (that uses 2 nodes, with 4 GPUs on each):  I'd...
@tcapelle Using the following code creates 8 different instances for the job on wandb(note the use of wandb.init()), but each of the instances contains plots from 4 GPUs (each ranked...
@tcapelle Here it is https://wandb.ai/tnnandi/test_2node_wandb_project?nw=nwusertnnandi
Just deleted the old 3 runs. You'll find 8 instances now. This is for a 2 node job (each node having 4 GPUs). I'm using deepspeed for distributed training.
Hi @walaj , thank you very much for the prompt response! Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link...
@walaj Thanks for the quick response! Which Makefile should I be using within the SeqLib main directory? git pull shows branch is already up to date. The error message for...