Edward Kim comments

Results 29 comments of


                                            Edward Kim

[BUG] UserWarning: You have more processes(4) than dataset [1,1]<stderr>: partitions(1), reduce the number of processes.

@ssubbayya Can you please share more information on how you arrived at that warning? A minimal reproducible code would be great. I'm particularly confused because you are using the `merlin-pytorch`...

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image

I'm working on updating the `merlin-tensorflow` image to `23.06` here: https://github.com/NVIDIA-Merlin/Merlin/pull/1040. After bumping the image version to `23.06` and updating the processing workflow in `train.py` to reflect recent changes, and...

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image

@wei-m-teh Apologies for the delay. It's in review at the moment, but I updated #1040 with a workaround I found for making the notebook work with the latest `23.08` image.

[BUG] Fixing Dataparallel Training

@bschifferer Why did we decide to split the dataset in the multi-gpu example instead of [repartitioning the dataset](https://github.com/NVIDIA-Merlin/models/blob/d4453cb599ef7ace289da758dff2c0ce11e69700/tests/unit/tf/horovod/test_horovod.py#L47-L52) with `row_group_size`? Did repartitioning not work?

[BUG] Fixing Dataparallel Training

> I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take? We are using dask dataframes so it's out of...

[BUG] Data parallel training freezes due to different number of batches

@jperez999 Is there a way to produce equal number of batches so that the workload is balanced across workers? Although nvtabular seems to produce equal-sized batches in [tf_trainer.py](https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/multi-gpu-movielens/tf_trainer.py), the number...

Edward Kim

[BUG] UserWarning: You have more processes(4) than dataset [1,1]<stderr>: partitions(1), reduce the number of processes.

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image

[BUG] Update Training and Serving Merlin on AWS SageMaker using latest merlin image

[BUG] Fixing Dataparallel Training

[BUG] Fixing Dataparallel Training

[BUG] Data parallel training freezes due to different number of batches

GPU memory does not get freed up properly after each batch

Shuffle doesn't work

Shuffle doesn't work

Device assignment does not work in PyTorch