Purvang issues

Results 14 issues of


                                            Purvang

Image semantic segmentation example with tf.data object on multi gpu training

Hello, I am training image semantic segmentation network on multiple gpu (4 gpus). Currently my data pipeline using tf.data.experimental_distribute_dataset from tensorflow as data pipeline and mirrored strategy. I want to...

help wanted

TensorFlow

perf

Different speed test results for different machine configuration

Related to **Model/Framework(s)** *(e.g. GNMT/PyTorch or FasterTransformer/All)* Model : UNet (backbone : Vgg16) (Semantic Segmentation) Framework : Tensorflow/pytorch **Describe the bug** A clear and concise description of what the bug...

bug

How to write tensorflow custom training loop with using horovod.

Hi, is there any example which uses horovod and custom training loop in tensorflow and not using .fit function to train on multiple gpus?

Multi node training question

Hi, I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connect with ssh. after starting multinode script, it initializes gpus...

Reproduce training result

HI @datvuthanh @xoiga123 , Could you please share parameters that you use and also training strategy, to reproduce the result? I am trying to reproduce result with suggested dataset from...

8xH100 server training time higher than 8xA100 server.

Related to **Model/Framework(s)** Tensorflow/Pytorch **Describe the bug** While running [Yolox](https://github.com/Megvii-BaseDetection/YOLOX) on servers described, H100 total training time is higher compared to A100 server. I also ran test script on servers...

bug

No profile directory generated under logs.

I am trying to train on my **H100** and trying to capture profile but there is no profile folder generated. code block ``` callbacks.append(tf.keras.callbacks.TensorBoard(log_dir="checkpoints/logs", histogram_freq=1, profile_batch=(0, 100))) ``` packages. tf-nightly...

[BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp_size > 1

### 🐛 Describe the bug Hi, I am trying to run llama2 7B model on [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct) dataset. As title suggest, training with hybrid_parallel or 3d plugin giving None loss, but...

bug

llama2 training hangs when pp_size > 1

**Describe the bug** I am following [guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html) to fine tune llama2-7B model on 2 nodes (H100). my training hangs at dalaloader sanity checking. ``` [NeMo I 2024-05-08 21:28:45 modelPT:724] Optimizer...

bug

No disk space left while loading llama2-70B for SFT

**Describe the bug** https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html docker image: nvcr.io/nvidia/nemo:24.01.01.framework converted llama2-70B hf model to Nemo using above doc, which is 129GB in size. I have disk space of 1.2T. while running on...

bug

stale