OneTrainer [Bug]: RuntimeError: GET was unable to find an engine to execute this computation

[Bug]: RuntimeError: GET was unable to find an engine to execute this computation

Open amrakm opened this issue 9 months ago • 8 comments

What happened?

I had OneTrainer working fine before on the same machine

I just pulled latest changes here and updated dependencies with update.sh

After update I started getting this error: RuntimeError: GET was unable to find an engine to execute this computation

I also tried starting from scratch by clone the repo again and starting a fresh install and got the same error

config json here (using standard preset for SDXL ) config.json

What did you expect would happen?

starts training

Relevant log output

Traceback (most recent call last):
  File "/home/username/dataSSD/repos/OneTrainer/modules/ui/TrainUI.py", line 523, in __training_thread_function
    trainer.train()
  File "/home/username/dataSSD/repos/OneTrainer/modules/trainer/GenericTrainer.py", line 469, in train
    self.data_loader.get_data_set().start_next_epoch()
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/MGDS.py", line 50, in start_next_epoch
    self.loading_pipeline.start_next_epoch()
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/LoadingPipeline.py", line 86, in start_next_epoch
    module.start(self.__current_epoch)
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/pipelineModules/DiskCache.py", line 231, in start
    self.__refresh_cache(out_variation)
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/pipelineModules/DiskCache.py", line 204, in __refresh_cache
    f.result()
  File "/home/username/anaconda3/envs/ot/lib/python3.10/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/home/username/anaconda3/envs/ot/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/home/username/anaconda3/envs/ot/lib/python3.10/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/pipelineModules/DiskCache.py", line 192, in fn
    split_item[name] = self._get_previous_item(in_variation, name, in_index)
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/PipelineModule.py", line 87, in _get_previous_item
    item = module.get_item(variation, index, item_name)
  File "/home/username/dataSSD/repos/OneTrainer/src/mgds/src/mgds/pipelineModules/EncodeVAE.py", line 46, in get_item
    latent_distribution = self.vae.encode(image.unsqueeze(0)).latent_dist
  File "/home/username/dataSSD/repos/OneTrainer/src/diffusers/src/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/username/dataSSD/repos/OneTrainer/src/diffusers/src/diffusers/models/autoencoders/autoencoder_kl.py", line 260, in encode
    h = self.encoder(x)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/dataSSD/repos/OneTrainer/src/diffusers/src/diffusers/models/autoencoders/vae.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/username/anaconda3/envs/ot/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

Output of `pip freeze`

absl-py==2.1.0 accelerate==0.25.0 aiohttp==3.9.5 aiosignal==1.3.1 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==23.2.0 bitsandbytes==0.43.0 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 cloudpickle==3.0.0 coloredlogs==15.0.1 customtkinter==5.2.1 dadaptation==3.2 darkdetect==0.8.0 -e git+https://github.com/huggingface/diffusers.git@5d848ec07c2011d600ce5e5c1aa02a03152aea9b#egg=diffusers filelock==3.14.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.3.1 ftfy==6.2.0 google-auth==2.29.0 google-auth-oauthlib==1.2.0 grpcio==1.62.2 huggingface-hub==0.20.3 humanfriendly==10.0 idna==3.7 importlib_metadata==7.1.0 invisible-watermark==0.2.0 Jinja2==3.1.3 lightning-utilities==0.11.2 lion-pytorch==0.1.2 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 -e git+https://github.com/Nerogar/mgds.git@1dc300967e75b6fa0fb4b72587f3df08a8278efd#egg=mgds mpmath==1.3.0 multidict==6.0.5 networkx==3.3 numpy==1.26.2 nvidia-cublas-cu11==11.11.3.6 nvidia-cuda-cupti-cu11==11.8.87 nvidia-cuda-nvrtc-cu11==11.8.89 nvidia-cuda-runtime-cu11==11.8.89 nvidia-cudnn-cu11==8.7.0.84 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.3.0.86 nvidia-cusolver-cu11==11.4.1.48 nvidia-cusparse-cu11==11.7.5.86 nvidia-nccl-cu11==2.19.3 nvidia-nvtx-cu11==11.8.86 oauthlib==3.2.2 omegaconf==2.3.0 onnxruntime-gpu==1.16.3 open-clip-torch==2.23.0 opencv-python==4.8.1.78 packaging==24.0 pillow==10.2.0 platformdirs==4.2.1 pooch==1.8.0 prodigyopt==1.0 protobuf==4.23.4 psutil==5.9.8 pyasn1==0.6.0 pyasn1_modules==0.4.0 Pygments==2.17.2 pynvml==11.5.0 pytorch-lightning==2.1.3 PyWavelets==1.6.0 PyYAML==6.0.1 regex==2024.4.28 requests==2.31.0 requests-oauthlib==2.0.0 rich==13.7.1 rsa==4.9 safetensors==0.4.1 scalene==1.5.39 scipy==1.11.4 sentencepiece==0.2.0 six==1.16.0 sympy==1.12 tensorboard==2.15.1 tensorboard-data-server==0.7.2 timm==0.9.16 tokenizers==0.15.2 torch==2.2.0+cu118 torchmetrics==1.3.2 torchvision==0.17.0+cu118 tqdm==4.66.1 transformers==4.36.2 triton==2.2.0 typing_extensions==4.11.0 urllib3==2.2.1 wcwidth==0.2.13 Werkzeug==3.0.2 xformers==0.0.24+cu118 yarl==1.9.4 zipp==3.18.1

Apr 30 '24 13:04 amrakm

OneTrainer OneTrainer copied to clipboard

[Bug]: RuntimeError: GET was unable to find an engine to execute this computation

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

OneTrainer
OneTrainer copied to clipboard

Output of `pip freeze`