tnnandi comments

Results 18 comments of


                                            tnnandi

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Thanks for the info @amyeroberts ! @parambharat Can I please know if this is a known issue, or am I not using it appropriately?

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Hi @parambharat , please find below the HF trainer code along with the job submission file (please make the required changes based on your environment): test_wandb.py: ``` import datetime import...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

HI @parambharat , please let me know if you could reproduce this issue at your end. Feel free to ask for clarifications!

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Thanks for the heads up, @parambharat , your effort is very much appreciated.

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Hi @tcapelle , thank you for the response. Here's a screenshot for GPU usage from one of my runs (that uses 2 nodes, with 4 GPUs on each): ![image](https://github.com/huggingface/transformers/assets/12446612/1eb4bc14-483e-4dcf-8398-9ab9f8dd50ee) I'd...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

@tcapelle Using the following code creates 8 different instances for the job on wandb(note the use of wandb.init()), but each of the instances contains plots from 4 GPUs (each ranked...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

@tcapelle Here it is https://wandb.ai/tnnandi/test_2node_wandb_project?nw=nwusertnnandi

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Just deleted the old 3 runs. You'll find 8 instances now. This is for a 2 node job (each node having 4 GPUs). I'm using deepspeed for distributed training.

Number of sequences in BAM header mismatches reference

Hi @walaj , thank you very much for the prompt response! Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link...

Installation issue

@walaj Thanks for the quick response! Which Makefile should I be using within the SeqLib main directory? git pull shows branch is already up to date. The error message for...