Purvang

Results 14 issues of Purvang

Hello, I am training image semantic segmentation network on multiple gpu (4 gpus). Currently my data pipeline using tf.data.experimental_distribute_dataset from tensorflow as data pipeline and mirrored strategy. I want to...

help wanted
TensorFlow
perf

Related to **Model/Framework(s)** *(e.g. GNMT/PyTorch or FasterTransformer/All)* Model : UNet (backbone : Vgg16) (Semantic Segmentation) Framework : Tensorflow/pytorch **Describe the bug** A clear and concise description of what the bug...

bug

Hi, is there any example which uses horovod and custom training loop in tensorflow and not using .fit function to train on multiple gpus?

Hi, I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connect with ssh. after starting multinode script, it initializes gpus...

HI @datvuthanh @xoiga123 , Could you please share parameters that you use and also training strategy, to reproduce the result? I am trying to reproduce result with suggested dataset from...

Related to **Model/Framework(s)** Tensorflow/Pytorch **Describe the bug** While running [Yolox](https://github.com/Megvii-BaseDetection/YOLOX) on servers described, H100 total training time is higher compared to A100 server. I also ran test script on servers...

bug

I am trying to train on my **H100** and trying to capture profile but there is no profile folder generated. code block ``` callbacks.append(tf.keras.callbacks.TensorBoard(log_dir="checkpoints/logs", histogram_freq=1, profile_batch=(0, 100))) ``` packages. tf-nightly...

### 🐛 Describe the bug Hi, I am trying to run llama2 7B model on [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct) dataset. As title suggest, training with hybrid_parallel or 3d plugin giving None loss, but...

bug

**Describe the bug** I am following [guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html) to fine tune llama2-7B model on 2 nodes (H100). my training hangs at dalaloader sanity checking. ``` [NeMo I 2024-05-08 21:28:45 modelPT:724] Optimizer...

bug

**Describe the bug** https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html docker image: nvcr.io/nvidia/nemo:24.01.01.framework converted llama2-70B hf model to Nemo using above doc, which is 129GB in size. I have disk space of 1.2T. while running on...

bug
stale