DALLE-pytorch icon indicating copy to clipboard operation
DALLE-pytorch copied to clipboard

(colab notebook) Train DALLE-pytorch on C@H

Open afiaka87 opened this issue 3 years ago • 11 comments

https://gist.github.com/afiaka87/b29213684a1dd633df20cab49d05209d

If there are any bugs - please make a comment below. When in doubt; restart your kernel. Tends to fix things a lot.

afiaka87 avatar Jun 08 '21 23:06 afiaka87

Hi i messaged you on discord but you seemed to be busy anyways i have an problem where its stuck at 'Time to load sparse_attn op:'. no matter what params i use it used to work now it takes 10 minutes+ is this a bug or a simple mistake from me?

johngore123 avatar Jun 09 '21 20:06 johngore123

And btw I'm valteralfred. @afiaka87

johngore123 avatar Jun 09 '21 20:06 johngore123

And btw I'm valteralfred. @afiaka87

Hey! I've seen this bug before I think. You need to delete a folder containing the precompiled pytorch extensions. I want to say it's in the /root/.cache/torch_extensions directory but am on mobile and can't check currently.

afiaka87 avatar Jun 09 '21 20:06 afiaka87

Thanks ill try that if it does not work ill try something else.

johngore123 avatar Jun 09 '21 20:06 johngore123

It seems to have fixed it self! thanks for the help.

johngore123 avatar Jun 09 '21 21:06 johngore123

I think the cache got cleaned

johngore123 avatar Jun 09 '21 21:06 johngore123

Anyone coming here from the notebook - I'm not really on the discord as often as I should be. File issues with the notebook here if you can or I'm not as likely to see them.

I believe the issue here is that pytorch or deepspeed or something gets stuck trying to compile an extension. When in doubt; restart the kernel on your notebook. You won't lose your instance - it'll just clear any local state you have currently. Then you can re-run the cell you were on before; no need to re-run the setup cells.

afiaka87 avatar Jun 11 '21 21:06 afiaka87

Hi there,

Trying the collar notebook for the first time. It gets stuck at the installation of NVIDIA apex. It seems that the --disable-pip-version-check doesn't seem to work?

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 171, in <module> check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal  "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 10.2.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = 
'"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"';f = getattr(tokenize, 
'"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = 
f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record 
/tmp/pip-record-q6hys36y/install-record.txt --single-version-externally-managed --compile --install-headers 
/usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 825, in install
    req_description=str(self.req),
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 81, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

SadRebel1000 avatar Jul 23 '21 10:07 SadRebel1000

I updated the colab notebook recently to train with the crawling @ home dataset. Hopefully fixed some of these issues.

afiaka87 avatar Sep 21 '21 19:09 afiaka87

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix: File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support" AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support Finished generating images, attempting to display results...

Also find related issue here: https://github.com/robvanvolt/DALLE-models/issues/13 but no one fixed yet.

Stomachache007 avatar Dec 17 '21 06:12 Stomachache007

@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix: File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support" AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support Finished generating images, attempting to display results...

Also find related issue here: https://github.com/robvanvolt/DALLE-models/issues/13 but no one fixed yet.

This has to do with Deep speed dropping support for a lot of GPUs with its sparse attention cuda code. I don't believe they are likely to work soon regrettably as I can no longer run them locally either.

afiaka87 avatar Dec 17 '21 10:12 afiaka87