bschifferer comments

Results 108 comments of


                                            bschifferer

[INF]Documentation improvement

Docstring Coverage (March 28th): Merlin Models: 40% Transformers4Rec: 41% Merlin Systems: 80% Merlin Core: 80% DataLoader: 78% Merlin Models: ============================ Coverage for /workspace/01_MerlinDev/62_DocStrings/models/merlin/ ============================ --------------------------------------------------------- Summary --------------------------------------------------------- | Name |...

[Task] Convert multi-gpu-nvtabular to blog post

@radekosmulski provided a draft here: https://docs.google.com/document/u/1/d/1gr9WDyfZwxr7FNm2fv6S_b7_B35xOdomCpTFjw8ycGc/edit#heading=h.zdt8bivd436s

0.1.16 version creates import errors

I think the error is related to a mismatch of versions between nvtabular, transformers4rec and merlin models. ``` 196 try: --> 197 from nvtabular.io.dataset import Dataset 198 except ImportError: ModuleNotFoundError:...

[BUG] Fixing Dataparallel Training

@edknv I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take?

[BUG] Fixing Dataparallel Training

does repartiton garantuee equal size datasets?

[BUG] Data parallel training freezes due to different number of batches

@jperez999 I provided an example with Merlin Models: https://github.com/NVIDIA-Merlin/models/pull/778 I add the seed_fn ``` train_dl = tf_dataloader.BatchedDataset( train, batch_size = batch_size, shuffle=True, drop_last=True, global_size=2, global_rank=hvd.rank(), seed_fn=seed_fn ) print(len(train_dl)) ``` When...

[BUG] Data parallel training freezes due to different number of batches

Even if it is a single file, I can have different number of batches. The `part_size` parameter controls the number of partitions and changes the number of batches ``` import...

[BUG] Data parallel training freezes due to different number of batches

Let's repartition the dataset based on [here](https://github.com/NVIDIA-Merlin/models/blob/d4453cb599ef7ace289da758dff2c0ce11e69700/tests/unit/tf/horovod/test_horovod.py#L47-L52) ``` import cudf import os import numpy as np from merlin.loader.tensorflow import Loader import nvtabular as nvt df = cudf.DataFrame({ 'col1': np.random.randint(0,1,size=100_000_000) })...

[BUG] Data parallel training freezes due to different number of batches

Having multiple files ``` import cudf import os import numpy as np from merlin.loader.tensorflow import Loader import nvtabular as nvt df = cudf.DataFrame({ 'col1': np.random.randint(0,1,size=100_000_000) }) df.to_parquet('single2_1.parquet') df = cudf.DataFrame({...

[BUG] Data parallel training freezes due to different number of batches

Having multiple files with repartition: ``` import cudf import os import numpy as np from merlin.loader.tensorflow import Loader import nvtabular as nvt df = cudf.DataFrame({ 'col1': np.random.randint(0,1,size=100_000_000) }) df.to_parquet('single2_1.parquet') df...