Transformers4Rec Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets

Open gabrielspmoreira opened this issue 4 years ago • 5 comments

trafficstars

Benchmark the new PyT data loader with the REES46 ecommerce dataset, using multiple GPUs

Train set: All train.parquet files for 31 days (1 parquet file by week). P.s. Set row group size accordingly Eval set: All valid.parquet files concatenated

[ ] Create a recsys_main.py variation for non-incremental training
[ ] Train with 3 weeks and evaluate on the last week
[ ] Run experiments varying the number of GPUs: Single GPU, Multi-GPU Data Parallel, Multi-GPU Distributed DataParallel

Jun 08 '21 14:06 gabrielspmoreira

Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader

NVTDataLoader(        
        global_size=global_size,
        global_rank=global_rank,

when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader.

We can reproduce it quickly with ecom_small

Aug 12 '21 21:08 rnyak

Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:

[] torch.distributed.barrier()
[] https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html

Aug 31 '21 15:08 rnyak

@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer?

Mar 03 '22 12:03 Ahanmr

@Ahanmr currently we are working on support training on multi-gpu.

Oct 03 '22 18:10 rnyak

@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day. So i'm training it as mentioned in the youchoose dataset example. I have few question regarding that...

Currently i'm training it with train batch size = 32, eval batch size = 16, as i'm having 16gb gpu memory. I'm not sure what could be better number number as per my resources, so any suggestion on that would be helpful?
After training it...after 500 days of traing the loss is 0, i'm not sure it is overfitting or it is the way. Do i need to stop or is there any better way to do the training.
Also these evaluation scores are very less for each day, so how any one know final evaluation score?

Any help would be great. Thanks!

Feb 22 '23 12:02 alan-ai-learner

Transformers4Rec Transformers4Rec copied to clipboard

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets

Transformers4Rec
Transformers4Rec copied to clipboard