DALLE-pytorch icon indicating copy to clipboard operation
DALLE-pytorch copied to clipboard

CI / Tests for DALLE-pytorch

Open robvanvolt opened this issue 3 years ago • 16 comments

As the number of papers (and pull requests) is increasing, the idea of testing a branch before merging it to master becomes more and more important.

@afiaka87 ans @rom1504 had the idea to include DALLE-Pytorch into a continuous integration cycle or at least run a few tests before merging a branch to ensure no essential features get broken.

A simple, first test could look as follows:

run the follwing commands (each with text-image-folders and a wds stream, as well as with open-AI and taming transformer):

  • python...
  • deepspeed...
  • deepspeed fp16
  • deepspeed apex

The tests can be speed up by

itertools.islice(dataloader, 10):`“ if Test=True else dataloader

So the tests only run for a few samples.

robvanvolt avatar Jun 13 '21 09:06 robvanvolt

Nice was just working on a less-than-automated version of this that's at least possible currently:

https://colab.research.google.com/gist/afiaka87/bbac0038b213a82c067b4766cdf45e0d/ci_train_dalle_pytorch_wip.ipynb

We just need to ask folks to run that notebook (just once) with their branch name and repo url. If they can't they're free to ping me and I'll do it as soon as possible. I'm sure others can pitch in in the discord as well.

Obviously not "CI" per se - but a better feedback loop than we have currently.

afiaka87 avatar Jun 13 '21 09:06 afiaka87

Feel free to hack at it as well - I just converted a notebook that's meant to be a bit more accessible.

The idea though is that this should be a way to run with just python, then sparse attention (no hard requirement on passing there due to colab issues), then deepspeed, then deepspeed apex amp, then deepspeed zero 1 fp16, 2 fp16, 3 fp16, etc.

Have you used DeepSpeed's profiler feature? It enables accurate timing of how many ms each part of your architecture took during both the forward and backward pass. It also conveniently automatically counts the parameters for and displays which parts of the architecture are responsible for the size of your model.

anyway; you can just enable that for each run and specify the profile_step and it will just quit out after that and give you this nice description as well.

afiaka87 avatar Jun 13 '21 09:06 afiaka87

The profiler feature would need a PR to implement; and all this deepspeed custom stuff is getting out of hand. They support json files; we should as well I think? I don't know if we're overriding it or not; but you can pass a json file to--deepspeed_config so long is you're running with the deepspeed command.

afiaka87 avatar Jun 13 '21 10:06 afiaka87

I went ahead and made a PR which adds the --flops_profiler flag. It will run normally for 200 steps and then on the 200th step enable a very precise timer and give you super detailed layouts of your model like this.

This is super handy for identifying bottlenecks in the architecture and benchmarking the various types of attention we have at our disposal.

https://github.com/lucidrains/DALLE-pytorch/pull/302

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model paralel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
umber of floating point operations (flops), floating point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                   1
data parallel size:                                           1
model paralel size:                                           1
batch size per GPU:                                           80
params per gpu:                                               336.23 M
params of model = params per GPU * mp_size:                   336.23 M
fwd MACs per GPU:                                             3139.93 G
fwd flops per GPU = 2 * fwd MACs per GPU:                     6279.86 G
fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
fwd latency:                                                  76.67 ms
bwd latency:                                                  108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
step latency:                                                 34.09 us
iter latency:                                                 184.73 ms
samples/second:                                               433.07

----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {'BertForPreTrainingPreLN': '336.23 M'}
    MACs        - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
    fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:
    params      - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
    MACs        - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
    fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:
    params      - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
    MACs        - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
    fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:
    params      - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
    MACs        - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
    fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:
    params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
    MACs        - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
    fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:
    params      - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
    MACs        - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
    fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:
    params      - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
    MACs        - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
    fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

BertForPreTrainingPreLN(
  336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
  (bert): BertModel(
    335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
    (embeddings): BertEmbeddings(...)
    (encoder): BertEncoder(
      302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
      (FinalLayerNorm): FusedLayerNorm(...)
      (layer): ModuleList(
        302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
        (0): BertLayer(
          12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
          (attention): BertAttention(
            4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
            (self): BertSelfAttention(
              3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
              (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
              (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
              (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
              (dropout): Dropout(...)
              (softmax): Softmax(...)
            )
            (output): BertSelfOutput(
              1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
              (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
              (dropout): Dropout(...)
            )
          )
          (PreAttentionLayerNorm): FusedLayerNorm(...)
          (PostAttentionLayerNorm): FusedLayerNorm(...)
          (intermediate): BertIntermediate(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
            (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
          )
          (output): BertOutput(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
            (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
            (dropout): Dropout(...)
          )
        )
        ...
        (23): BertLayer(...)
      )
    )
    (pooler): BertPooler(...)
  )
  (cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------

afiaka87 avatar Jun 13 '21 12:06 afiaka87

This is really cool! Also the colab (see the WDS implementation, I tested it on your colab https://github.com/lucidrains/DALLE-pytorch/pull/280#issuecomment-860207682).

But I think it might get a little flickery if we test only using colab - a few simple automatic test steps could still be useful where you do not have to rerun the colab every few seconds and change the parameters manually.:)

robvanvolt avatar Jun 13 '21 13:06 robvanvolt

This is really cool! Also the colab (see the WDS implementation, I tested it on your colab #280 (comment)).

But I think it might get a little flickery if we test only using colab - a few simple automatic test steps could still be useful where you do not have to rerun the colab every few seconds and change the parameters manually.:)

Yeah sorry I didn't make it clearer; I think for now at least just running training is way more than we've been doing. The notebook was originally designed with so many fields because it was for the public. Definitely don't expect anyone to be running the notebook more than once.

Hopefully we can keep adding important edge cases as we go? I guess the main one is "does deep speed work"; but you're right - testing deepspeed on colab is sort of contrived and doesn't help much. I guess this should be more of a health-check/integration test just to make sure people who can find bugs themselves a bit more easily.

Also - lucidrains seems to be super busy lately; so if one of us can say "verified this runs in the notebook" real quick; I hope it can act as a mild vetting process for contributions.

afiaka87 avatar Jun 13 '21 20:06 afiaka87

This is really cool! Also the colab (see the WDS implementation, I tested it on your colab #280 (comment)).

But I think it might get a little flickery if we test only using colab - a few simple automatic test steps could still be useful where you do not have to rerun the colab every few seconds and change the parameters manually.:)

Do you have anything in mind other than colab which can be spun up that easily? @rom1504 mentioned running like 10 steps on the CPU; but if that winds up being "a whole thing" then I probably won't spend too much time on it. Have honestly never tried with DALLE (or anything really) - but I'm guessing it doesn't work as it stands?

afiaka87 avatar Jun 13 '21 20:06 afiaka87

This is really cool! Also the colab (see the WDS implementation, I tested it on your colab #280 (comment)). But I think it might get a little flickery if we test only using colab - a few simple automatic test steps could still be useful where you do not have to rerun the colab every few seconds and change the parameters manually.:)

Do you have anything in mind other than colab which can be spun up that easily? @rom1504 mentioned running like 10 steps on the CPU; but if that winds up being "a whole thing" then I probably won't spend too much time on it. Have honestly never tried with DALLE (or anything really) - but I'm guessing it doesn't work as it stands?

Sure, a simple test.py could be used (with a sample data folder and an URL with a small .tar WebDataset - or instead use islice(dl, 10) on the dataloader if --test is provided as an argument to run for only 10 steps) like this:

test_dict = {
    'test_dalle_raw': "python train_dalle.py ..."
    'test_deepspeed''": "deepspeed train_dalle.py ..."
    'test_deepspeed_fp16': ...
    'test_deepspeed_apex': ...
    'test_deepspeed_wds': 
    and so on
}

runs = []

for key in test_dict:
      try:
           print('Running test on {}'.format(key)')
           print('Command: {}'.format(test_dict[key])
           exec(test_dict[key])
     excecpt Error as e:
           runs.append([key, 'Failed', timer])
           print(e)
     else:
           print('Successfully ran {}-test in {} seconds.'.format(key, timer)
           runs.append([key, 'Successfull', timer])

if len([x for x in runs if runs[1] == 'Successfull') 
    print('Successfully ran all tests')
else:
    print('The code has to be reviewed before being able to merged to main!.')
    print.table(runs)

I can do a Pull request on that matter, as I've already tested a lot with WebDataset, I could reuse that code..:)

robvanvolt avatar Jun 14 '21 05:06 robvanvolt

This is really cool! Also the colab (see the WDS implementation, I tested it on your colab #280 (comment)). But I think it might get a little flickery if we test only using colab - a few simple automatic test steps could still be useful where you do not have to rerun the colab every few seconds and change the parameters manually.:)

Do you have anything in mind other than colab which can be spun up that easily? @rom1504 mentioned running like 10 steps on the CPU; but if that winds up being "a whole thing" then I probably won't spend too much time on it. Have honestly never tried with DALLE (or anything really) - but I'm guessing it doesn't work as it stands?

Sure, a simple test.py could be used (with a sample data folder and an URL with a small .tar WebDataset - or instead use islice(dl, 10) on the dataloader if --test is provided as an argument to run for only 10 steps) like this:

test_dict = {
    'test_dalle_raw': "python train_dalle.py ..."
    'test_deepspeed''": "deepspeed train_dalle.py ..."
    'test_deepspeed_fp16': ...
    'test_deepspeed_apex': ...
    'test_deepspeed_wds': 
    and so on
}

runs = []

for key in test_dict:
      try:
           print('Running test on {}'.format(key)')
           print('Command: {}'.format(test_dict[key])
           exec(test_dict[key])
     excecpt Error as e:
           runs.append([key, 'Failed', timer])
           print(e)
     else:
           print('Successfully ran {}-test in {} seconds.'.format(key, timer)
           runs.append([key, 'Successfull', timer])

if len([x for x in runs if runs[1] == 'Successfull') 
    print('Successfully ran all tests')
else:
    print('The code has to be reviewed before being able to merged to main!.')
    print.table(runs)

I can do a Pull request on that matter, as I've already tested a lot with WebDataset, I could reuse that code..:)

Yeah that would be fantastic to have some actual tests as well!

Where would it run though? We can't automate colab and GitHub actions doesnt provide GPU instances.

afiaka87 avatar Jun 14 '21 20:06 afiaka87

At first, before CI really gets implemented in a more elaborate way, someone could just copy and paste the output of the test.py into the pull request to show the metrics / passed tests. It does not require too much effort and gets sorted out some obvious bugs I think.:)

robvanvolt avatar Jun 15 '21 05:06 robvanvolt

Unless someone want to rent a GPU for CI (and I'm not sure that's a good use of resources), I think the reasonable thing to do is using GitHub action with dalle pytorch running on cpu. It won't be super fast but it doesn't really need to be. At least worth a try imo

rom1504 avatar Jun 15 '21 11:06 rom1504

Unless someone want to rent a GPU for CI (and I'm not sure that's a good use of resources), I think the reasonable thing to do is using GitHub action with dalle pytorch running on cpu. It won't be super fast but it doesn't really need to be. At least worth a try imo

Actions do not provide GPU instances said afaika, and thus the most important tests couldn't be run, could they? So I thought at least for the coder, which almost all have a GPU instance at home, they can run a few tests not taking longer than a few minutes before finally confirming the Pull request working?:)

robvanvolt avatar Jun 15 '21 12:06 robvanvolt

Why can't the most important tests be run on CPU ? Even if it's takes 5min to do 3 steps of batch size 2 on cpu, it's good enough

rom1504 avatar Jun 15 '21 19:06 rom1504

Why can't the most important tests be run on CPU ?

Even if it's takes 5min to do 3 steps of batch size 2 on cpu, it's good enough

Is that so? I was not able to get apex and deepspeed running without CUDA / CPU-only. :o But might give it a second try if you say it should work

robvanvolt avatar Jun 15 '21 20:06 robvanvolt

Why can't the most important tests be run on CPU ? Even if it's takes 5min to do 3 steps of batch size 2 on cpu, it's good enough

Not that it can't - just that my understanding is that it's not going to simply work with zero modifications.

afiaka87 avatar Jun 15 '21 20:06 afiaka87

Ah yes for sure probably some tweaking would be needed to support cpu. Not sure how much indeed

rom1504 avatar Jun 15 '21 20:06 rom1504