Transformers4Rec
Transformers4Rec copied to clipboard
[QST] - Multi-GPU Support w/ Naive model.fit()
❓ Questions & Help
Details
Hello! My team is using T4Rec w/ a BinaryClassificationTask Head. We are utilizing the recommended class method model.fit()
rather than the trainer class.
We are currently running experiments on various AWS instances (4xV100 / 8xV100) & it appears that the model is training on a single gpu.
Should the naive model.fit()
have distributed training out of the box?
Thanks in advance!
CLI output below during training:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 62C P0 88W / 300W | 2132MiB / 16160MiB | 7% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 45C P0 39W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 42C P0 44W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 45C P0 40W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A
Hi @jacobdineen. We have currently two ways of training models with Transformers4Rec PyTorch API:
-
Trainer.train()
-
Model.fit()
The 2nd approach was created to make the PyTorch API aligns with the TF API (like Keras model.fit()
). But for PyTorch API it is just a simple train loop, as you can see here, which does not support multi-GPU OOTB.
The trainer.train()
is the most comprehensive training method, as we inherit our Trainer
from the HuggingFace's Trainer and use its train() method, which leverages Torch's DataParallel
and DistributedDataParallel
options for multi-GPU training, as mentioned in this doc.
So for multi-GPU data parallel training I suggest either using our Trainer.train()
or creating you custom train loop based on that method, wrapping the model with DataParallel
or DistributedDataParallel
.
Thank you, @gabrielspmoreira! You can close this issue. My team and I will try to write a custom train loop for the keras-style PyTorch API.
@gabrielspmoreira Is there any chance that this will be added as a feature in later releases?
Hi @jacobdineen . Can you be a little bit more specific on what you are looking for. Do you need multi-GPU training but prefer a simpler training loop (like we have for Model.fit()
) rather than using the Trainer class we inherit from HF Trainer?
Apologies for not being clear. I'll send over some additional context tomorrow.
@gabrielspmoreira For additional context -
Model.fit() is currently the preferred way to use the package for tasks involving BinaryClassification. When using the Trainer class for BinaryClassification, the package uses some form of unsupervised training (like MLM/PLM) to construct the target feature on the fly, ignoring the explicit target.
Using Model.fit()
, as you've noted, is a viable workaround for this problem when the end-user has an explicit target variable. There appears to be some disharmony between the way this method/class works (keras-style) against what is compatible for wrapping a model object in DataParallel
or DistributedDataParallel
.
Basically, the fit
method outlined no longer exists once the model is wrapped in DataParallel
- it exists in a call to wrapped_model.module.fit, but that's no longer wrapped with distributed and only creates a single model replica and batches to single gpu mode.
This is only an issue with the backward pass, it seems. The Model
class inherits from torch.nn.module
, so wrapping the object in DataParallel
works for evaluation as it invokes torch.nn.Module.__call__
.
Let me know if I can clarify any further.
@jacobdineen thanks for more context. I guess your answer to Gabriel's question Do you need multi-GPU training but prefer a simpler training loop (like we have for Model.fit()) rather than using the Trainer class we inherit from HF Trainer?
is a Yes, isnt it?
@rnyak Correct, ideally the simpler training loop would support multi-gpu distributed training.
My team and I will try to write a custom train loop for the keras-style PyTorch API.
@jacobdineen what's the current status with your custom work? I understand, DistributedDataParallel
did not work. Did you test DataParallel
and gain any training run time perf improvement with it ? Thanks.
@rnyak A member of the team is going to test out the solutions from this issue in the next day or so.
As for what we have tried - Assume the following setup:
# type: transformers4rec.torch.model.base.Model
model = transformer_config.to_torch_model(input_module, prediction_task)
model = model.cuda()
train_path = glob.glob(os.path.join(output_dir, f"1/train.parquet"))
train_loader = next(iter(get_dataloader(schema, train_path)))
The workflow we currently follow is incremental (as the tutorials are):
for time_index in range(start_time_window_index, final_time_window_index):
# fetch paths and dataloaders here
...........
model.fit(train_loader,
optimizer=optimizer,
num_epochs=num_epochs,
amp=False,
verbose = False)
# custom model evaluate code here
The issue with torch.nn.DataParallel
is that we can not simply wrap the model and call fit
, e.g., :
import torch.nn as nn
wrapped_model = nn.DataParallel(model)
wrapped_model.fit(data)
Throws an attribute error that the now DataParallel
object has no attribute fit. As discussed above, the torch model class inherits from torch.nn.module, so wrapping the object in DataParallel works for evaluation as it invokes torch.nn.Module.call on the forward pass
So something like this still works for flowing forward:
train_loader = next(iter(get_dataloader(schema, train_path)))
x, y = data[0]
wrapped_model = nn.DataParallel(model)
wrapped_model(x) - computes probas
But to my understanding (correct me if I'm wrong), accessing the t4rec model's class methods via wrapped_model.module.fit()
doesn't invoke DataParallel
at all.
This is a pretty good example of the issue
This issue is going to be addressed by task #487
@rnyak , please close this issue after tagging the PRs.