dorado
dorado copied to clipboard
How to improve the reproducibility of dorado across different GPUs?
Issue Report
Please describe the issue:
We used to run dorado on 4xA100 80GB. But my boss asked me to explore running it on 8xT4 16GB. I did that. 8xT4 runs about 5x slower as expected but seemingly produce quite different variant calling results.
Please provide a clear and concise description of the issue you are seeing and the result you expect. However, the SNP variant calling results are ~0.1% difference in hg38 GIAB high confidence region and ~10% difference in low confidence region according to rtg tools. I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.
My understanding is that due to different VRAM size in these two GPUs, the batch sizes picked are different such that the AI inference results are different. Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls? Are there other parameters I need to tune to make the results more reproducible? Thanks a lot in advance,
Presumably, the 8xT4 should pick a smaller batch size that will likely to produce slightly suboptimal result. That probably explains why more reads are called by A100. Is that right?
Run environment:
- Dorado version: 0.4.1
- Dorado command: dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_5mCG_5hmCG@v2
- Operating system: Ubuntu 22.04
- Hardware (CPUs, Memory, GPUs): 4xA100 and 8xT4
Hi @ymcki,
Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls?
Setting the same batch size will not result in nearly identical basecalls because the GPU architectures are different. The two different GPU architectures (Turing / Ampere) will result in different paths through the LSTM kernels as we are able to make some optimisations depending on the architecture. These subtle differences can also have an effect on read splitting which is what may be what you're seeing where some reads are called by one GPU but not the other.
Kind regards, Rich
Hi @ymcki,
Does that mean if force them to run at the same batch size, then I can expect nearly identical basecalls?
Setting the same batch size will not result in nearly identical basecalls because the GPU architectures are different. The two different GPU architectures (Turing / Ampere) will result in different paths through the LSTM kernels as we are able to make some optimisations depending on the architecture. These subtle differences can also have an effect on read splitting which is what may be what you're seeing where some reads are called by one GPU but not the other.
Kind regards, Rich
Thank you very much for your reply. So I think it is futile to try to improve the reproducibility.
Is A100 more likely to call better reads than T4 due to your code being optimized for A100?
@HalfPhoton Rich,
Can there be documentation made on this, or some more in depth explanation done on this talking about even subtle differences in basecalling seen between different architectures? In core lab facilities and clinical settings it is important to know what consequences integrating different architectures is onto our workflows as you can understand. Even for CAP/CLIA it is important to have this documentation showing these differences so we can show margin of error on calling
If there is anything that we can do to help, please let us know.
We will definitely be updating the README here to document this phenomenon.
@ymcki - can you explain in more detail what you mean by this:
I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.
I'm a little bit confused - when you say " 3,379,509 are fully identical" what do you mean?
As a follow up to this, is the same issue seen here expected with the A10 or A30 GPUs? I imagine the answer is no since they utilize the same architecture as the A100, but would like confirmation.
Thank you!
We will definitely be updating the README here to document this phenomenon.
@ymcki - can you explain in more detail what you mean by this:
I compared unmapped bams and find that among the 20,544,209 reads called by 4xA100 only 3,379,509 (16.45%) are fully identical. 13,117 reads are called by 4xA100 but not by 8xT4. 7,490 reads are called by 8xT4 but not by 4xA100.
I'm a little bit confused - when you say " 3,379,509 are fully identical" what do you mean?
I am saying for the reads with the same name, they are fully identical if they have exactly the same length and perfect match.
I probably should use some sort of dynamic programming alignment to generate alignment scores for better characterization of differences but that's too much work for a prelim analysis.
As a follow up to this, is the same issue seen here expected with the A10 or A30 GPUs? I imagine the answer is no since they utilize the same architecture as the A100, but would like confirmation.
Thank you!
A30 is the same architecture as A100 but A10 is not.
@ymcki Where did you find this? I am unfamiliar with GPUs so went to Nvidia's datasheet for the GPUs...which suggests all three run on the Ampere architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/datasheet-new/nvidia-a10-datasheet.pdf
Hi @ymcki,
Thank you very much for raising this. We are investigating this issue and would really appreciate more insight into your results. Would you mind getting in touch with [email protected] referencing this Github issue and mentioning my name (Susie Lee) so we can discuss it in more detail, please?
Many thanks, Susie
@ymcki Where did you find this? I am unfamiliar with GPUs so went to Nvidia's datasheet for the GPUs...which suggests all three run on the Ampere architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/datasheet-new/nvidia-a10-datasheet.pdf
https://github.com/nanoporetech/dorado/issues/459
Previously, someone here said that dorado is optimized for 164KB shared memory architecture. A10 only has 100KB shared memory, so it will run in FP16 mode instead of the faster INT8 mode. Well, as INT8 is an approximation/emulation of FP16, you can't expect them to produce the same results.
I tried re-running basecalling on the 4xA100. I found that 100% of the reads basecalled are fully identical. That means basecalling is a deterministic process that will 100% reproduce the result when the hardware is the same.
What kind of GPU differences can cause this irreproducibility? I have these GPU nodes in the cluster I use:
- 2x NVIDIA Tesla V100 16GB
- 3x NVIDIA V100 16GB
- 4x NVIDIA A100 80GB, divided into 2 40GB MIG instances (16 total)
My basecalling jobs are spread across these nodes. Should I worry about reproducibility?