dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Multi A100 expected CPU/GPU utilization in correction inference stage

Open ChristianKniep opened this issue 10 months ago • 5 comments

I am trying to utilize a p4.24xl (8xA100 and plenty of CPUs) but when running the step I barely see any CPU/GPU utilization. There's a couple of GB of memory allocated within the GPU and the CPU - but not a lot else going on.

I am using the version 0.9.0+9dc15a8 within the upstream dockerhub container.

$ dorado correct /shared/data.fq \
  --from-paf /shared/data_overlaps.paf \
  > /shared/data_corrected_reads.fasta

Output

[2024-12-19 10:17:19.679] [info] Running: "correct" "/shared/data.fq" "--from-paf" "/shared/data_overlaps.paf"
[2024-12-19 10:17:19.736] [info]  - downloading herro-v1 with httplib
[2024-12-19 10:17:22.201] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2024-12-19 10:17:22.201] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2024-12-19 10:17:22.374] [info] Using batch size 12 on device cuda:1 in inference thread 0.
[2024-12-19 10:17:22.374] [info] Using batch size 12 on device cuda:1 in inference thread 1.
[2024-12-19 10:17:22.581] [info] Using batch size 12 on device cuda:2 in inference thread 0.
[2024-12-19 10:17:22.587] [info] Using batch size 12 on device cuda:2 in inference thread 1.
[2024-12-19 10:17:22.824] [info] Using batch size 12 on device cuda:3 in inference thread 0.
[2024-12-19 10:17:22.824] [info] Using batch size 12 on device cuda:3 in inference thread 1.
[2024-12-19 10:17:23.007] [info] Using batch size 12 on device cuda:4 in inference thread 0.
[2024-12-19 10:17:23.007] [info] Using batch size 12 on device cuda:4 in inference thread 1.
[2024-12-19 10:17:23.190] [info] Using batch size 12 on device cuda:5 in inference thread 0.
[2024-12-19 10:17:23.190] [info] Using batch size 12 on device cuda:5 in inference thread 1.
[2024-12-19 10:17:23.379] [info] Using batch size 12 on device cuda:6 in inference thread 0.
[2024-12-19 10:17:23.380] [info] Using batch size 12 on device cuda:6 in inference thread 1.
[2024-12-19 10:17:23.569] [info] Using batch size 12 on device cuda:7 in inference thread 0.
[2024-12-19 10:17:23.570] [info] Using batch size 12 on device cuda:7 in inference thread 1.

How can I improve the utilization of the box...?

ChristianKniep avatar Dec 19 '24 11:12 ChristianKniep