darts
darts copied to clipboard
GPU Optimization with Num_Workers not working
I am not very experienced, but I loveee this package. However, my GPU acceleration seems to only utilize about 1% of my GPU. Increasing the batch size made my predictions far less accurate. And I read that increasing num_loader_workers will work, but I get an log stating that I should set persistent_workers =True in the val_dataloader package but I know Darts does not work this way. And then the model runs 5 times longer. Can you please assist? I just got a better GPU to optimize my training time but I can't get it to use more of the GPU? Here is my model for reference:
NHiTS_Model = NHiTSModel(
model_name="Nhits_run",
input_chunk_length=input_length_chunk,
output_chunk_length=forecasting_horizon,
num_stacks=number_stacks,
num_blocks=number_blocks,
num_layers=number_layers,
layer_widths=lay_widths,
n_epochs=number_epochs,
nr_epochs_val_period=number_epochs_val_period,
batch_size=batch_size,
dropout=dropout_rate,
force_reset = True,
save_checkpoints=True,
optimizer_cls = torch.optim.AdamW,
loss_fn = torch.nn.HuberLoss(),
random_state =rand_state,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0]}
)
NHiTS_Model.fit(
series=train,
past_covariates=train_cov,
verbose=True,
val_series=val,
val_past_covariates=val_cov,
num_loader_workers=1
)
Oh and the newer gpu and the much weaker one trains the same length of time so somewhere is a bottle neck.
Hi @Laenita,
Would you mind sharing the value of the parameters? So that we can have an idea of the number of parameters/size of the model.
Is the GPU acceleration being used at 1% for both the old and the new devices?
The pl_trainer_kwargs
argument looks good, this is what Pytorch-Lightning expects to enable this acceleration. I would recommend looking up their documentation at this this what Darts relies on for the deep learning models.
Hi @madtoinou
Of course here are my parameters for my model I hope this helps: input_length_chunk = 20 forecasting_horizon = 3 number_stacks = 4 number_blocks = 5 number_layers = 5 batch_size = 64 dropout_rate = 0.1 number_epochs = 180 number_epochs_val_period = 1
And yes, both the old and newer (and much faster) GPU's are both only showing 1% utilisation and also training the same time on the same model, indicating that something is wrong and heavy under-utilising.
But also the num_loader_workers=1 is not working at all for me, takes more than an hour with num_loader_workers >0.
Thanks for your assistance!
Yes, I have the same problem: I am told that num_loader_workers is not a legit parameter.
Hi @igorrivin & @Laenita,
As mentioned in another tread, the PR ##2295 is adding support for those arguments. Maybe try installing this branch/copy the changes and see if it solves the bottleneck?
Hi @madtoinou
I have copied the changes from PR #https://github.com/unit8co/darts/pull/2295 But now whenever I add persistent_workers= True and num_loader_workers=16 (or even just 1) it gets stuck on Sanity_checking? Did I maybe miss anything? Thank you for your assistance!
Which sanity checking are you referring to?
Hi @madtoinou, the best explanation I can show is this PNG where the model first goes into a Sanity Checking Phase before starting training:
Hi @Laenita,
Is the problem still occurring?
The sanity checks is a mechanism implemented by pytorch lightning (see here), you could try to disable it by passing pl_trainer_kwargs={"num_sanity_val_steps":0}
.
Since #2295 has been merged, you could try again with dataloader_kwargs={"persistent_worker":True, "num_loader_worker":1}
in fit()
.
Also, does the GPU utilization increase if you increase the size of the model? And when you change the size of the batch? Or is it always 1%?