scvi-tools multiVI fails with invalid tensor

Hi scvi team, thanks for the excellent tools! I'm reaching out with an issue running MultiVI with versions scvi-tools >=0.20.3 the error is the same reported here and here.

i set up two actions to reproduce the error. I'm running the same script with the same toy dataset, only changing the scvi-tools version in the environment. (you can see all the dependencies in the conda list step). the action is ran on ubuntu but i have the same issue on an intel chip.

this action, with scvi-tools=1.1.1 fails with the same error as described before https://github.com/DendrouLab/panpipes/actions/runs/8114022415/job/22178710942?pr=201

this action, with scvi-tools=0.20.3, runs successfully. https://github.com/DendrouLab/panpipes/actions/runs/8114022411/job/22178710284?pr=201

thanks for your help!

Mar 01 '24 16:03 bio-la

Hi, the errors you are referring to are issues about MPS Apple hardware. We have seen the issue you are facing in multiVI in newer scvi-tools version. The main reason in our hands is that we have changed the default see in newer scvi-tools version. Do you fix the seed and still face the problem? Can you share the toy dataset (how large is it - ncells and file size)? It would be interesting to explore what updates in the TrainingPlan remove these errors.

Mar 01 '24 16:03 canergen

thanks for your answer! here is the toy dataset (approx 2000 cells by 4000 features)

i didn't include a seed but i will try that and let you know.

Mar 01 '24 16:03 bio-la

Unfortunately, no success with explicit scvi.settings.seed https://github.com/DendrouLab/panpipes/actions/runs/8114593332/job/22180588924?pr=201

Mar 01 '24 17:03 bio-la

Hi @bio-la, sorry you're running into this issue. It looks like the issue might be slightly different than the Discourse threads you linked since the CI is running on Ubuntu, not macOS, so I'm guessing this is an issue unrelated to a PyTorch MPS build.

Could you try passing in a lower learning rate (maybe lr=1e-5 or lr=1e-6) and see if that helps? Also, is this error occurring in the first epoch of training or later on? I wasn't able to find that info in the logs.

Mar 01 '24 17:03 martinkim0

thanks for your suggestions. here are my comments:

i get the same error message on Ubuntu and Macos Intel chip. I wouldn't know why the effect is the same on M3 chips, if you speculate that the cause is not the same.
the error appears at the first epoch of training
changing lr from 1e-3 to 1e-5 doesn't solve the issue, scvi 0.20.3 still works with the lower lr
i noticed that the conda install is now pulling scvi 1.1.2, still failing here.

I'm using this dataset with the following parameters for Multivi:

MultiVI:
    batch_covariate: dataset
    model_args:
      n_hidden : None 
      n_latent :  None
      #(bool,default: True)
      region_factors : True 
       #{‘normal’, ‘ln’} (default: 'normal')
      latent_distribution : 'normal'
      #(bool,default: False)
      deeply_inject_covariates : False 
      #(bool, default: False)
      fully_paired : False 
    training_args:
      #(default: 500)
      max_epochs : 500 
      #float (default: 0.0001)
      lr : 1.0e-05
      #leave blanck for default str | int | bool | None (default: None)
      use_gpu :
      # float (default: 0.9)
      train_size : 0.9 
      # leave blanck for default, float | None (default: None)
      validation_size : 
      # int (default: 128)
      batch_size : 128
      #float (default: 0.001)
      weight_decay : 0.001 
      #float (default: 1.0e-08)
      eps : 1.0e-08 
      #bool (default: True)
      early_stopping : True 
      #bool (default: True)
      save_best : True
       #leave blanck for default int | None (default: None)
      check_val_every_n_epoch :
      #leave blank for default int | None (default: None)
      n_steps_kl_warmup : 
       # int | None (default: 50)
      n_epochs_kl_warmup : 50
      #bool (default: True)
      adversarial_mixing : True 
       #leave blank for default dict | None (default: None)
    training_plan : None

Mar 04 '24 10:03 bio-la

Hi, @martinkim0 any news on this? thanks!

Mar 11 '24 15:03 bio-la

We are on it. I have one solution (gradient clipping) that solves similar problems in totalVI (AdversarialTrainingplan is not stable). We first need to make sure that it doesn't reduce quality in downstream tasks. I would assume adversarial_mixing=False will solve it for now in your tests but will reduce integration. Last question: Can you tell me the AnnData version used to save the object above? I had issues in my testing environment opening the file (likely outdated AnnData on my end).

Mar 24 '24 18:03 canergen

Hi @bio-la, currently we enable an adversarial classifier even if only a single batch is present in the dataset. This is a bug. We will have a fix soonish. To overcome the error for now, you can pass mvi.train(adversarial_mixing=False). Please let us know, if you are still facing the issue.

Mar 26 '24 07:03 canergen

awesome! will wait for the changes to be merged and let you know. thank you!

Mar 26 '24 07:03 bio-la

Fixed in #2914 and will be released with scvi-tools 1.2.

Jul 26 '24 04:07 canergen

Hi @canergen! Thank you for providing the fix. I would like to use multiVI for my project, I was wondering if you have any indication of when scvi-tools 1.2 will be released?

Aug 16 '24 10:08 wlason