DALLE-pytorch
DALLE-pytorch copied to clipboard
(colab notebook) Train DALLE-pytorch on C@H
https://gist.github.com/afiaka87/b29213684a1dd633df20cab49d05209d
If there are any bugs - please make a comment below. When in doubt; restart your kernel. Tends to fix things a lot.
Hi i messaged you on discord but you seemed to be busy anyways i have an problem where its stuck at 'Time to load sparse_attn op:'. no matter what params i use it used to work now it takes 10 minutes+ is this a bug or a simple mistake from me?
And btw I'm valteralfred. @afiaka87
And btw I'm valteralfred. @afiaka87
Hey! I've seen this bug before I think. You need to delete a folder containing the precompiled pytorch extensions. I want to say it's in the /root/.cache/torch_extensions directory but am on mobile and can't check currently.
Thanks ill try that if it does not work ill try something else.
It seems to have fixed it self! thanks for the help.
I think the cache got cleaned
Anyone coming here from the notebook - I'm not really on the discord as often as I should be. File issues with the notebook here if you can or I'm not as likely to see them.
I believe the issue here is that pytorch or deepspeed or something gets stuck trying to compile an extension. When in doubt; restart the kernel on your notebook. You won't lose your instance - it'll just clear any local state you have currently. Then you can re-run the cell you were on before; no need to re-run the setup cells.
Hi there,
Trying the collar notebook for the first time. It gets stuck at the installation of NVIDIA apex. It seems that the --disable-pip-version-check
doesn't seem to work?
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 171, in <module> check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/tmp/pip-req-build-ikzk4nf6/setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.2.
In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk).
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] =
'"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ikzk4nf6/setup.py'"'"';f = getattr(tokenize,
'"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code =
f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record
/tmp/pip-record-q6hys36y/install-record.txt --single-version-externally-managed --compile --install-headers
/usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pip/_internal/req/req_install.py", line 825, in install
req_description=str(self.req),
File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/install/legacy.py", line 81, in install
raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure
I updated the colab notebook recently to train with the crawling @ home dataset. Hopefully fixed some of these issues.
@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix: File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support" AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support Finished generating images, attempting to display results...
Also find related issue here: https://github.com/robvanvolt/DALLE-models/issues/13 but no one fixed yet.
@afiaka87 Hi, thanks for your sharing. I am using the afiaka dalle generation colab.https://colab.research.google.com/drive/11V2xw1eLPfZvzW8UQyTUhqCEU71w6Pr4?usp=sharing#scrollTo=682c5804-5f97-469f-8cf1-1cc8356591b8. Got several bugs I don't know how to fix: File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 127, in forward assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support" AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support Finished generating images, attempting to display results...
Also find related issue here: https://github.com/robvanvolt/DALLE-models/issues/13 but no one fixed yet.
This has to do with Deep speed dropping support for a lot of GPUs with its sparse attention cuda code. I don't believe they are likely to work soon regrettably as I can no longer run them locally either.