darts icon indicating copy to clipboard operation
darts copied to clipboard

GPU Optimization with Num_Workers not working

Open Laenita opened this issue 9 months ago • 9 comments

I am not very experienced, but I loveee this package. However, my GPU acceleration seems to only utilize about 1% of my GPU. Increasing the batch size made my predictions far less accurate. And I read that increasing num_loader_workers will work, but I get an log stating that I should set persistent_workers =True in the val_dataloader package but I know Darts does not work this way. And then the model runs 5 times longer. Can you please assist? I just got a better GPU to optimize my training time but I can't get it to use more of the GPU? Here is my model for reference:

    NHiTS_Model = NHiTSModel(
    model_name="Nhits_run",
    input_chunk_length=input_length_chunk,
            output_chunk_length=forecasting_horizon,
            num_stacks=number_stacks,
            num_blocks=number_blocks,
            num_layers=number_layers,
            layer_widths=lay_widths,
            n_epochs=number_epochs,
            nr_epochs_val_period=number_epochs_val_period,
            batch_size=batch_size,
            dropout=dropout_rate,
            force_reset = True,
            save_checkpoints=True,
            optimizer_cls = torch.optim.AdamW,
            loss_fn = torch.nn.HuberLoss(), 
            random_state =rand_state,
            pl_trainer_kwargs={
                    "accelerator": "gpu", 
                    "devices": [0]}  
              )
    NHiTS_Model.fit(
        series=train,
        past_covariates=train_cov,
        verbose=True,
        val_series=val,
        val_past_covariates=val_cov,
        num_loader_workers=1
    )

Laenita avatar Apr 26 '24 12:04 Laenita

Oh and the newer gpu and the much weaker one trains the same length of time so somewhere is a bottle neck.

Laenita avatar Apr 26 '24 20:04 Laenita

Hi @Laenita,

Would you mind sharing the value of the parameters? So that we can have an idea of the number of parameters/size of the model.

Is the GPU acceleration being used at 1% for both the old and the new devices?

The pl_trainer_kwargs argument looks good, this is what Pytorch-Lightning expects to enable this acceleration. I would recommend looking up their documentation at this this what Darts relies on for the deep learning models.

madtoinou avatar Apr 29 '24 09:04 madtoinou

Hi @madtoinou

Of course here are my parameters for my model I hope this helps: input_length_chunk = 20 forecasting_horizon = 3 number_stacks = 4 number_blocks = 5 number_layers = 5 batch_size = 64 dropout_rate = 0.1 number_epochs = 180 number_epochs_val_period = 1

And yes, both the old and newer (and much faster) GPU's are both only showing 1% utilisation and also training the same time on the same model, indicating that something is wrong and heavy under-utilising.

But also the num_loader_workers=1 is not working at all for me, takes more than an hour with num_loader_workers >0.

Thanks for your assistance!

Laenita avatar May 01 '24 21:05 Laenita

Yes, I have the same problem: I am told that num_loader_workers is not a legit parameter.

igorrivin avatar May 02 '24 11:05 igorrivin

Hi @igorrivin & @Laenita,

As mentioned in another tread, the PR ##2295 is adding support for those arguments. Maybe try installing this branch/copy the changes and see if it solves the bottleneck?

madtoinou avatar May 03 '24 07:05 madtoinou

Hi @madtoinou

I have copied the changes from PR #https://github.com/unit8co/darts/pull/2295 But now whenever I add persistent_workers= True and num_loader_workers=16 (or even just 1) it gets stuck on Sanity_checking? Did I maybe miss anything? Thank you for your assistance!

Laenita avatar May 09 '24 08:05 Laenita

Which sanity checking are you referring to?

madtoinou avatar May 10 '24 06:05 madtoinou

Hi @madtoinou, the best explanation I can show is this PNG where the model first goes into a Sanity Checking Phase before starting training: Sanity Checking

Laenita avatar May 22 '24 19:05 Laenita

Hi @Laenita,

Is the problem still occurring?

The sanity checks is a mechanism implemented by pytorch lightning (see here), you could try to disable it by passing pl_trainer_kwargs={"num_sanity_val_steps":0}.

Since #2295 has been merged, you could try again with dataloader_kwargs={"persistent_worker":True, "num_loader_worker":1} in fit().

Also, does the GPU utilization increase if you increase the size of the model? And when you change the size of the batch? Or is it always 1%?

madtoinou avatar Aug 28 '24 07:08 madtoinou