❓ Questions & Help

Details

Hello! My team is using T4Rec w/ a BinaryClassificationTask Head. We are utilizing the recommended class method model.fit() rather than the trainer class.

We are currently running experiments on various AWS instances (4xV100 / 8xV100) & it appears that the model is training on a single gpu.

Should the naive model.fit() have distributed training out of the box?

Thanks in advance!

CLI output below during training:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   62C    P0    88W / 300W |   2132MiB / 16160MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   45C    P0    39W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    40W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A

May 23 '22 20:05 jacobdineen

Hi @jacobdineen. We have currently two ways of training models with Transformers4Rec PyTorch API:

Trainer.train()
Model.fit()

The 2nd approach was created to make the PyTorch API aligns with the TF API (like Keras model.fit()). But for PyTorch API it is just a simple train loop, as you can see here, which does not support multi-GPU OOTB.

The trainer.train() is the most comprehensive training method, as we inherit our Trainer from the HuggingFace's Trainer and use its train() method, which leverages Torch's DataParallel and DistributedDataParallel options for multi-GPU training, as mentioned in this doc.

So for multi-GPU data parallel training I suggest either using our Trainer.train() or creating you custom train loop based on that method, wrapping the model with DataParallel or DistributedDataParallel.

May 26 '22 16:05 gabrielspmoreira

Thank you, @gabrielspmoreira! You can close this issue. My team and I will try to write a custom train loop for the keras-style PyTorch API.

May 26 '22 18:05 jacobdineen

@gabrielspmoreira Is there any chance that this will be added as a feature in later releases?

Jun 13 '22 18:06 jacobdineen

Hi @jacobdineen . Can you be a little bit more specific on what you are looking for. Do you need multi-GPU training but prefer a simpler training loop (like we have for Model.fit()) rather than using the Trainer class we inherit from HF Trainer?

Jun 16 '22 01:06 gabrielspmoreira

Apologies for not being clear. I'll send over some additional context tomorrow.

Jun 16 '22 02:06 jacobdineen

@gabrielspmoreira For additional context -

Model.fit() is currently the preferred way to use the package for tasks involving BinaryClassification. When using the Trainer class for BinaryClassification, the package uses some form of unsupervised training (like MLM/PLM) to construct the target feature on the fly, ignoring the explicit target.

Using Model.fit(), as you've noted, is a viable workaround for this problem when the end-user has an explicit target variable. There appears to be some disharmony between the way this method/class works (keras-style) against what is compatible for wrapping a model object in DataParallel or DistributedDataParallel.

Basically, the fit method outlined no longer exists once the model is wrapped in DataParallel - it exists in a call to wrapped_model.module.fit, but that's no longer wrapped with distributed and only creates a single model replica and batches to single gpu mode.

This is only an issue with the backward pass, it seems. The Model class inherits from torch.nn.module, so wrapping the object in DataParallel works for evaluation as it invokes torch.nn.Module.__call__.

Let me know if I can clarify any further.

Jun 16 '22 14:06 jacobdineen

@jacobdineen thanks for more context. I guess your answer to Gabriel's question Do you need multi-GPU training but prefer a simpler training loop (like we have for Model.fit()) rather than using the Trainer class we inherit from HF Trainer? is a Yes, isnt it?

Jun 17 '22 13:06 rnyak

@rnyak Correct, ideally the simpler training loop would support multi-gpu distributed training.

Jun 17 '22 14:06 jacobdineen

My team and I will try to write a custom train loop for the keras-style PyTorch API.

@jacobdineen what's the current status with your custom work? I understand, DistributedDataParallel did not work. Did you test DataParallel and gain any training run time perf improvement with it ? Thanks.

Aug 03 '22 14:08 rnyak

@rnyak A member of the team is going to test out the solutions from this issue in the next day or so.

As for what we have tried - Assume the following setup:

# type: transformers4rec.torch.model.base.Model
model = transformer_config.to_torch_model(input_module, prediction_task)
model = model.cuda()


train_path = glob.glob(os.path.join(output_dir, f"1/train.parquet"))
train_loader = next(iter(get_dataloader(schema, train_path)))

The workflow we currently follow is incremental (as the tutorials are):

for time_index in range(start_time_window_index, final_time_window_index):
    # fetch paths and dataloaders here
   ...........
    model.fit(train_loader, 
              optimizer=optimizer, 
              num_epochs=num_epochs, 
              amp=False, 
              verbose = False)

   # custom model evaluate code here

The issue with torch.nn.DataParallel is that we can not simply wrap the model and call fit, e.g., :

import torch.nn as nn
wrapped_model = nn.DataParallel(model)
wrapped_model.fit(data)

Throws an attribute error that the now DataParallel object has no attribute fit. As discussed above, the torch model class inherits from torch.nn.module, so wrapping the object in DataParallel works for evaluation as it invokes torch.nn.Module.call on the forward pass

So something like this still works for flowing forward:

train_loader = next(iter(get_dataloader(schema, train_path)))
x, y = data[0]

wrapped_model = nn.DataParallel(model)
wrapped_model(x) - computes probas

But to my understanding (correct me if I'm wrong), accessing the t4rec model's class methods via wrapped_model.module.fit() doesn't invoke DataParallel at all.

Aug 03 '22 15:08 jacobdineen

This is a pretty good example of the issue

Aug 03 '22 17:08 jacobdineen

This is a pretty good example of the issue

Thanks for more context.

Aug 03 '22 18:08 rnyak

This issue is going to be addressed by task #487

Oct 05 '22 23:10 gabrielspmoreira

@rnyak , please close this issue after tagging the PRs.

Jan 03 '23 18:01 viswa-nvidia

Transformers4Rec
Transformers4Rec copied to clipboard

[QST] - Multi-GPU Support w/ Naive model.fit()

❓ Questions & Help

Details

Transformers4Rec Transformers4Rec copied to clipboard

[QST] - Multi-GPU Support w/ Naive model.fit()

❓ Questions & Help

Details

Transformers4Rec
Transformers4Rec copied to clipboard