DALLE-pytorch icon indicating copy to clipboard operation
DALLE-pytorch copied to clipboard

Reproducing DALL-E using DeepSpeed

Open mehdidc opened this issue 3 years ago • 32 comments

Hi @lucidrains, Hi @robvanvolt,

@JeniaJitsev started initially a discussion in the discord channel of @robvanvolt. Just a brief recap. We (@JeniaJitsev, @janEbert, and myself) are in a research group in Germany, Helmholtz AI, which is part of the Helmholtz Association. We are interested in reproducing DALL-E. We have the possibility to offer you access to A100 GPUs (from https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) for reproducing the model, ideally using DeepSpeed for distributed training. What are your thoughts ? Would you be interested ?

mehdidc avatar Mar 29 '21 17:03 mehdidc

@mehdidc Hi Mehdi! I am actually busy with protein folding replication (Alphafold2), but I think @robvanvolt and @afiaka87 would definitely love to make use of the resources :) Thank you!

lucidrains avatar Mar 29 '21 17:03 lucidrains

@mehdidc Hi Mehdi! I am actually busy with protein folding replication (Alphafold2), but I think @robvanvolt and @afiaka87 would definitely love to make use of the resources :) Thank you!

@lucidrains Just for context, I deferred them to you due to my inability to answer questions regarding multi-GPU compute.

@mehdidc Seems we're all a bit busy at the moment. I will do my best to help you with this if you can file issues for us, but I've decided to be fairly hands off in the discord chat for personal reasons.

afiaka87 avatar Mar 29 '21 18:03 afiaka87

@lucidrains I know you're busy but a quick yes or no will suffice-

Does the codebase in its current form make use of multiple GPUs?

afiaka87 avatar Mar 29 '21 18:03 afiaka87

@mehdidc Just to be clear - we are quite interested.

I'll be making this a high priority but can only help so much due to my lack of machine learning knowledge. I'm assuming robvanvolt feels similarly, but they are also dealing with quite an in-surge in traffic on the newly created discord.

If you have a bit of patience though, we'll both be able to help you out along the process.

afiaka87 avatar Mar 29 '21 18:03 afiaka87

ohhh right, so the current script does not do multi-GPU, but it should be pretty easy to get multi-GPU working with the newest deepspeed (or pytorch lightning). I'll see what I can do tomorrow

lucidrains avatar Mar 29 '21 18:03 lucidrains

Hey folks! So there will be no trouble to arrange access to compute resources in size of 2 compute nodes with 4x GPUs each, given that we have together a look into multi GPU execution, preferable using DeepSpeed (as it seems to me the most straightforward way with transformers right now), but open for other suggestions. @lucidrains I can imagine that starting from that, we can also enter for working together on AlphaFold2 as well, at least with regard to its Transformer component. So it can be that we can turn it into a generic collaboration on distributed training of various useful architectures on multi node, multi GPU .

Please let us know what do you think.

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

Hey folks! So there will be no trouble to arrange access to compute resources in size of 2 compute nodes with 4x GPUs each, given that we have together a look into multi GPU execution, preferable using DeepSpeed (as it seems to me the most straightforward way with transformers right now), but open for other suggestions. @lucidrains I can imagine that starting from that, we can also enter for working together on AlphaFold2 as well, at least with regard to its Transformer component. So it can be that we can turn it into a generic collaboration on distributed training of various useful architectures on multi node, multi GPU .

Please let us know what do you think.

My first concern is with regard to deepspeed. I've not yet been able to get it working (independently) with the sparse attention that we use. Is this something you've dealt with?

I believe lucidrains has gotten it working because there's an install script in the repo and code for it. But as it stands we don't have a pleasant docker deploy type scenario (and those scripts don't seem to work on my configs even if I switch to the correct cudatoolkit, etc.).

Furthermore - I'm not certain that microsoft actually supports the A100 GPU yet. For now it seems your best bet is to deploy using a V100 or a Titan RTX. I've filed an issue about this here. Give it a thumbs up and maybe they'll have a look? Not likely though. That's not to say that it won't work - but it may require severe tinkering.

afiaka87 avatar Mar 29 '21 18:03 afiaka87

Good question - so first we have to clarify whether we get the sparse attention transformer and deepspeed go along together. We ourselves haven't tried it - in fact, we only run deepspeed in a very simple multi-gpu scenario, data parallel mode, for standard CIFAR-10 supervised training on ResNet-50, so quite boring.

How about that: I will provide you links with instructions how to register at our supercomputing facilities and will grant you access to some compute resources. We then can try together to make this one particular test, deepspeed with sparse attention transformer.

Timewise there is no hurry. In fact we are also unfortunately quite busy and until end of May will have kind of sparse way )) to do hands on with your together. From June on, it looks then better. But first steps we can manage, so that you have your environment on compute nodes and the libraries in place etc. One note - on supercomputing nodes, it is not really possible to switch flexibly low level things like nvidia drivers or switch between a lot of different CUDA versions etc, if that becomes necessary.

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

Furthermore - I'm not certain that microsoft actually supports the A100 GPU yet. For now it seems your best bet is to deploy using a V100 or a Titan RTX. I've filed an issue about this here. Give it a thumbs up and maybe they'll have a look? Not likely though. That's not to say that it won't work - but it may require severe tinkering.

It is not a problem to start with V100, we have nodes with those as well.

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

Hm - well if you're not in desperate need of the actual sparse attention then as far as I'm concerned - turn it off the moment it gives you problems ha.

And yeah, I believe the V100s would be a better starting point to just get the code running at least. Do any of you have local dev environments with GPUS you can use as well without needing to explicitly include them in your budget?

afiaka87 avatar Mar 29 '21 18:03 afiaka87

We do have a machine with 4x V100 without budget limitation, with a drawback that is not accessible from outside. I think it would be better to get on a machine where we all can work indeed together. Let's try to have a model training running on a compute node where we all have access. Once we have it tuned, we can commit longer training runs on the local machine for further testing

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

@lucidrains @afiaka87 Let's do a step like this: please drop me a short email to [email protected], and I send your instructions so that you can already register for the access, so that I can also already add you both to the compute project. We do this step and see from then on how to organize ourselves.

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

With regard to sparse attention: here another colleague of us, Alex Strube, @surak, was opening issue at deepspeed - so it should be fine to go with V100 and cuda 11, it seems from that discussion: https://github.com/microsoft/DeepSpeed/issues/790

JeniaJitsev avatar Mar 29 '21 18:03 JeniaJitsev

@lucidrains If it's fine with you, I'd take the learning experience with DeepSpeed and try to get it running on some V100s tomorrow. Please tell me if you'd rather do it yourself, otherwise I'm definitely up to relieve you from that.

janEbert avatar Mar 29 '21 18:03 janEbert

@janEbert would be pleased for you to take the helm!

lucidrains avatar Mar 29 '21 19:03 lucidrains

Thanks for the trust. ;)

janEbert avatar Mar 29 '21 19:03 janEbert

Awesome! this got a little traction fast! :D I'm currently trying to get deepspeed with sparse attention running on a 3090rtx (should work on the A100 then if it succeeds).

@afiaka87 is right, i'm rather new to ML, just a programmer for a little more than a decade, so i wouldn't be of much help in the deep darks of ML outside of a little code optimization / preprocessing and translating of captions / organizing stuff (that was the reason for the discord a more organized crew and less "chat" here in the github issues)

robvanvolt avatar Mar 29 '21 19:03 robvanvolt

@lucidrains If it's fine with you, I'd take the learning experience with DeepSpeed and try to get it running on some V100s tomorrow. Please tell me if you'd rather do it yourself, otherwise I'm definitely up to relieve you from that.

@janEbert Thanks a ton for taking this up! Your prior experience means you're likely to figure that out a bit faster than I could have. I'm still happy to help - don't intend to get in your way of course. @JeniaJitsev I agree that a team environment is going to work well here! Thank you very much. At most I'll be seeing how janEbert's progress is going, but I am also interested in access and may be able to log in and fix things occasionally going forward. I'll send you an email now and we can discuss it there.

afiaka87 avatar Mar 29 '21 19:03 afiaka87

Ah right, you also mentioned you wanted to do it, sorry! I'll see how far I can get tomorrow and stay in touch with you on the Discord, is that okay?

janEbert avatar Mar 29 '21 19:03 janEbert

Ah right, you also mentioned you wanted to do it, sorry! I'll see how far I can get tomorrow and stay in touch with you on the Discord, is that okay?

Please do! I'll be highly available to help if you need anything.

afiaka87 avatar Mar 29 '21 19:03 afiaka87

That's great to know, thank you!

janEbert avatar Mar 29 '21 19:03 janEbert

@janEbert @JeniaJitsev This system is indeed complex. Could I borrow one of you for a quick tutorial on deploying dalle-pytorch with a proper dataset? I believe that would speed up things a bit for me.

afiaka87 avatar Mar 31 '21 00:03 afiaka87

@janEbert @JeniaJitsev This system is indeed complex. Could I borrow one of you for a quick tutorial on deploying dalle-pytorch with a proper dataset? I believe that would speed up things a bit for me.

You can also borrow @mehdidc who originated this issue, he will be also eager to help I guess ))

JeniaJitsev avatar Mar 31 '21 08:03 JeniaJitsev

Great work so far! I just want to throw it out there that Pytorch Lightning has deepspeed (and wandb) integration https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed

Perhaps by using it we can get the best of both worlds and have it be significantly less complex than it needs to be?

lucidrains avatar Mar 31 '21 14:03 lucidrains

Thanks for the suggestion, I didn't know about that! To the software engineer in me it's valuable to have direct access to the API I'm using. For what it's worth, if I understood the documentation correctly, training is now set up so we can do anything PyTorch Lightning can (using ZeRO or ZeRO-Offload). However, we may even use DeepSpeed's pipeline parallelism if we wrap our models which Lightning does not support yet.

I definitely see the value in clean (and, even more importantly, battle-tested) research code. For now, we could clean up the training code by wrapping even the non-distributed models so we don't need to handle distributed and non-distributed update step code in different ways. This may cause other issues with users, though. It's a hard take. :/

janEbert avatar Mar 31 '21 15:03 janEbert

Well, as expected, I was completely wrong... :D I didn't manage to get ZeRO to work with the current setup, for example. Seems more changes have to be made.

janEbert avatar Mar 31 '21 16:03 janEbert

@janEbert no worries, give it another try :) I believe in you

lucidrains avatar Mar 31 '21 16:03 lucidrains

@janEbert , @mehdidc & everyone:

It seems Eleuther AI folks work on training and then releasing a publicly available large GPT version (175B one), and they as well use a code base that employs DeepSpeed. Looks to me it can be helpful in deepspeed experiments we conduct. They also have their own fork of DeepSpeed adapted to this end.

  • GPT‑NeoX is an implementation of 3D-parallel GPT‑3-like models on distributed GPUs, based upon DeepSpeed and Megatron-LM. It is designed to be able to train models in the hundreds of billions of parameters or larger (https://www.eleuther.ai/projects/gpt-neox/)
  • As of 2021-03-31, the codebase is fairly stable. DeepSpeed, 3D-parallelism and ZeRO are all working properly. [seems ZeRO stage 1 only is working, ZeRO 3 is in progress]
  • [still no surprise on sparse attention situation here] Deepspeed's sparse attention kernels are supported, but don't work with cuda 11.0+, and require a specific hardware setup (V100s/RTX2080s). add "sparsity": "all" to your config to use sparse attention on all layers, or "sparsity": "interspersed" to use it every other layer.
  • https://github.com/EleutherAI/gpt-neox
  • https://github.com/EleutherAI/DeeperSpeed

JeniaJitsev avatar Apr 02 '21 10:04 JeniaJitsev

Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code.

janEbert avatar Apr 02 '21 11:04 janEbert

Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code.

Okay, if it will work out of the box with DeepSpeed, the better. Less libraries, less trouble ))

JeniaJitsev avatar Apr 02 '21 12:04 JeniaJitsev