dorado icon indicating copy to clipboard operation
dorado copied to clipboard

How to improve the reproducibility of dorado across different GPUs?

Open ymcki opened this issue 1 year ago • 12 comments

Issue Report

Please describe the issue:

We used to run dorado on 4xA100 80GB. But my boss asked me to explore running it on 8xT4 16GB. I did that. 8xT4 runs about 5x slower as expected but seemingly produce quite different variant calling results.

Please provide a clear and concise description of the issue you are seeing and the result you expect. However, the SNP variant calling results are ~0.1% difference in hg38 GIAB high confidence region and ~10% difference in low confidence region according to rtg tools. I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.

My understanding is that due to different VRAM size in these two GPUs, the batch sizes picked are different such that the AI inference results are different. Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls? Are there other parameters I need to tune to make the results more reproducible? Thanks a lot in advance,

Presumably, the 8xT4 should pick a smaller batch size that will likely to produce slightly suboptimal result. That probably explains why more reads are called by A100. Is that right?

Run environment:

  • Dorado version: 0.4.1
  • Dorado command: dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_5mCG_5hmCG@v2
  • Operating system: Ubuntu 22.04
  • Hardware (CPUs, Memory, GPUs): 4xA100 and 8xT4

ymcki avatar Feb 07 '24 01:02 ymcki

Hi @ymcki,

Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls?

Setting the same batch size will not result in nearly identical basecalls because the GPU architectures are different. The two different GPU architectures (Turing / Ampere) will result in different paths through the LSTM kernels as we are able to make some optimisations depending on the architecture. These subtle differences can also have an effect on read splitting which is what may be what you're seeing where some reads are called by one GPU but not the other.

Kind regards, Rich

HalfPhoton avatar Feb 07 '24 11:02 HalfPhoton

Hi @ymcki,

Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls?

Setting the same batch size will not result in nearly identical basecalls because the GPU architectures are different. The two different GPU architectures (Turing / Ampere) will result in different paths through the LSTM kernels as we are able to make some optimisations depending on the architecture. These subtle differences can also have an effect on read splitting which is what may be what you're seeing where some reads are called by one GPU but not the other.

Kind regards, Rich

Thank you very much for your reply. So I think it is futile to try to improve the reproducibility.

Is A100 more likely to call better reads than T4 due to your code being optimized for A100?

ymcki avatar Feb 08 '24 01:02 ymcki

@HalfPhoton Rich,

Can there be documentation made on this, or some more in depth explanation done on this talking about even subtle differences in basecalling seen between different architectures? In core lab facilities and clinical settings it is important to know what consequences integrating different architectures is onto our workflows as you can understand. Even for CAP/CLIA it is important to have this documentation showing these differences so we can show margin of error on calling

If there is anything that we can do to help, please let us know.

ethan-mcq avatar Feb 08 '24 04:02 ethan-mcq

We will definitely be updating the README here to document this phenomenon.

@ymcki - can you explain in more detail what you mean by this:

I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.

I'm a little bit confused - when you say " 3,379,509 are fully identical" what do you mean?

vellamike avatar Feb 08 '24 09:02 vellamike

As a follow up to this, is the same issue seen here expected with the A10 or A30 GPUs? I imagine the answer is no since they utilize the same architecture as the A100, but would like confirmation.

Thank you!

minefield47 avatar Feb 08 '24 22:02 minefield47

We will definitely be updating the README here to document this phenomenon.

@ymcki - can you explain in more detail what you mean by this:

I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.

I'm a little bit confused - when you say " 3,379,509 are fully identical" what do you mean?

I am saying for the reads with the same name, they are fully identical if they have exactly the same length and perfect match.

I probably should use some sort of dynamic programming alignment to generate alignment scores for better characterization of differences but that's too much work for a prelim analysis.

ymcki avatar Feb 09 '24 03:02 ymcki

As a follow up to this, is the same issue seen here expected with the A10 or A30 GPUs? I imagine the answer is no since they utilize the same architecture as the A100, but would like confirmation.

Thank you!

A30 is the same architecture as A100 but A10 is not.

ymcki avatar Feb 09 '24 03:02 ymcki

@ymcki Where did you find this? I am unfamiliar with GPUs so went to Nvidia's datasheet for the GPUs...which suggests all three run on the Ampere architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/datasheet-new/nvidia-a10-datasheet.pdf

minefield47 avatar Feb 09 '24 03:02 minefield47

Hi @ymcki,

Thank you very much for raising this. We are investigating this issue and would really appreciate more insight into your results. Would you mind getting in touch with [email protected] referencing this Github issue and mentioning my name (Susie Lee) so we can discuss it in more detail, please?

Many thanks, Susie

susie-ont avatar Feb 12 '24 14:02 susie-ont

@ymcki Where did you find this? I am unfamiliar with GPUs so went to Nvidia's datasheet for the GPUs...which suggests all three run on the Ampere architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/datasheet-new/nvidia-a10-datasheet.pdf

https://github.com/nanoporetech/dorado/issues/459

Previously, someone here said that dorado is optimized for 164KB shared memory architecture. A10 only has 100KB shared memory, so it will run in FP16 mode instead of the faster INT8 mode. Well, as INT8 is an approximation/emulation of FP16, you can't expect them to produce the same results.

ymcki avatar Feb 14 '24 01:02 ymcki

I tried re-running basecalling on the 4xA100. I found that 100% of the reads basecalled are fully identical. That means basecalling is a deterministic process that will 100% reproduce the result when the hardware is the same.

ymcki avatar Feb 27 '24 04:02 ymcki

What kind of GPU differences can cause this irreproducibility? I have these GPU nodes in the cluster I use:

  • 2x NVIDIA Tesla V100 16GB
  • 3x NVIDIA V100 16GB
  • 4x NVIDIA A100 80GB, divided into 2 40GB MIG instances (16 total)

My basecalling jobs are spread across these nodes. Should I worry about reproducibility?

weishwu avatar Sep 25 '24 16:09 weishwu