multiVI fails with invalid tensor
Hi scvi team, thanks for the excellent tools! I'm reaching out with an issue running MultiVI with versions scvi-tools >=0.20.3 the error is the same reported here and here.
i set up two actions to reproduce the error. I'm running the same script with the same toy dataset, only changing the scvi-tools version in the environment. (you can see all the dependencies in the conda list step). the action is ran on ubuntu but i have the same issue on an intel chip.
this action, with scvi-tools=1.1.1 fails with the same error as described before https://github.com/DendrouLab/panpipes/actions/runs/8114022415/job/22178710942?pr=201
this action, with scvi-tools=0.20.3, runs successfully. https://github.com/DendrouLab/panpipes/actions/runs/8114022411/job/22178710284?pr=201
thanks for your help!
Hi, the errors you are referring to are issues about MPS Apple hardware. We have seen the issue you are facing in multiVI in newer scvi-tools version. The main reason in our hands is that we have changed the default see in newer scvi-tools version. Do you fix the seed and still face the problem? Can you share the toy dataset (how large is it - ncells and file size)? It would be interesting to explore what updates in the TrainingPlan remove these errors.
thanks for your answer! here is the toy dataset (approx 2000 cells by 4000 features)
i didn't include a seed but i will try that and let you know.
Unfortunately, no success with explicit scvi.settings.seed https://github.com/DendrouLab/panpipes/actions/runs/8114593332/job/22180588924?pr=201
Hi @bio-la, sorry you're running into this issue. It looks like the issue might be slightly different than the Discourse threads you linked since the CI is running on Ubuntu, not macOS, so I'm guessing this is an issue unrelated to a PyTorch MPS build.
Could you try passing in a lower learning rate (maybe lr=1e-5 or lr=1e-6) and see if that helps? Also, is this error occurring in the first epoch of training or later on? I wasn't able to find that info in the logs.
thanks for your suggestions. here are my comments:
- i get the same error message on Ubuntu and Macos Intel chip. I wouldn't know why the effect is the same on M3 chips, if you speculate that the cause is not the same.
- the error appears at the first epoch of training
- changing lr from
1e-3to1e-5doesn't solve the issue,scvi 0.20.3still works with the lower lr - i noticed that the conda install is now pulling
scvi 1.1.2, still failing here.
I'm using this dataset with the following parameters for Multivi:
MultiVI:
batch_covariate: dataset
model_args:
n_hidden : None
n_latent : None
#(bool,default: True)
region_factors : True
#{‘normal’, ‘ln’} (default: 'normal')
latent_distribution : 'normal'
#(bool,default: False)
deeply_inject_covariates : False
#(bool, default: False)
fully_paired : False
training_args:
#(default: 500)
max_epochs : 500
#float (default: 0.0001)
lr : 1.0e-05
#leave blanck for default str | int | bool | None (default: None)
use_gpu :
# float (default: 0.9)
train_size : 0.9
# leave blanck for default, float | None (default: None)
validation_size :
# int (default: 128)
batch_size : 128
#float (default: 0.001)
weight_decay : 0.001
#float (default: 1.0e-08)
eps : 1.0e-08
#bool (default: True)
early_stopping : True
#bool (default: True)
save_best : True
#leave blanck for default int | None (default: None)
check_val_every_n_epoch :
#leave blank for default int | None (default: None)
n_steps_kl_warmup :
# int | None (default: 50)
n_epochs_kl_warmup : 50
#bool (default: True)
adversarial_mixing : True
#leave blank for default dict | None (default: None)
training_plan : None
Hi, @martinkim0 any news on this? thanks!
We are on it. I have one solution (gradient clipping) that solves similar problems in totalVI (AdversarialTrainingplan is not stable). We first need to make sure that it doesn't reduce quality in downstream tasks. I would assume adversarial_mixing=False will solve it for now in your tests but will reduce integration. Last question: Can you tell me the AnnData version used to save the object above? I had issues in my testing environment opening the file (likely outdated AnnData on my end).
Hi @bio-la, currently we enable an adversarial classifier even if only a single batch is present in the dataset. This is a bug. We will have a fix soonish. To overcome the error for now, you can pass mvi.train(adversarial_mixing=False). Please let us know, if you are still facing the issue.
awesome! will wait for the changes to be merged and let you know. thank you!
Fixed in #2914 and will be released with scvi-tools 1.2.
Hi @canergen! Thank you for providing the fix. I would like to use multiVI for my project, I was wondering if you have any indication of when scvi-tools 1.2 will be released?