dorado icon indicating copy to clipboard operation
dorado copied to clipboard

9.4.1 simplex errors with sup and hac models on M2

Open dbernick opened this issue 2 years ago • 8 comments

Using the 9.4.1 fast v3.4 model on a M2 mac runs without errors at 1.12e+07 samples/sec Using 9.4.1 hac v3.3 produces multiple: Metal command buffer list failed: 5, at 3.479e+05 samples/sec Using 9.4.1 sup v9.4.1 v3.6 produces pages of errors: with an ultimate rate of: 8.575e+04 samples/sec sup errors: a few of these: [2023-07-26 12:57:48.280] [warning] Metal command buffer lstm failed: 5, try #0 [2023-07-26 12:57:48.280] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-07-26 12:5 hundreds of these: [2023-07-28 05:04:26.476] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-07-28 05:04:42.360] [warning] Metal command buffer linear/scan/softmax failed: 5, try #2 [2023-07-28 05:04:42.360] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-07-28 05:04:58.183] [warning] Metal command buffer linear/scan/softmax failed: 5, try #3 [2023-07-28 05:04:58.183] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-07-28 05:05:14.227] [warning] Metal command buffer linear/scan/softmax failed: 5, try #4 [2023-07-28 05:05:14.227] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) also, the eventual count of sequences is off: [2023-07-28 05:05:14.390] [info] > Reads basecalled: 29600000s] [2023-07-28 05:05:14.391] [info] > Basecalled @ Samples/s: 8.575702e+04 [2023-07-28 05:05:19.144] [info] > Finished

fast and hac models are returning 296000 for Reads base called.

dbernick avatar Jul 31 '23 18:07 dbernick

I have the same problem, M2 Ultra on MAcStudio.

[2023-12-06 16:41:26.246] [info] > Creating basecall pipeline [2023-12-06 16:42:59.390] [info] - set batch size to 3504 [2023-12-06 16:44:14.950] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-06 16:44:14.950] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-06 16:44:53.973] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 1) [2023-12-06 16:44:53.973] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-06 16:45:31.890] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 2) [2023-12-06 16:45:31.890] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-06 16:46:10.444] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 3) [2023-12-06 16:46:10.444] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-06 16:46:48.556] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 4) [2023-12-06 16:46:48.556] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-06 16:46:48.577] [critical] Failed to successfully submit GPU command buffers.

j-jamshidi avatar Dec 06 '23 06:12 j-jamshidi

I have the same problem with M2 Pro and dorado 0.5.0

dorado basecaller ~/dorado_model/[email protected] ./pod5/ -x metal > basecall.bam [2023-12-07 13:38:41.346] [info] > Creating basecall pipeline [2023-12-07 13:38:57.083] [info] - set batch size to 432 [2023-12-07 13:39:29.928] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:39:29.929] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:39:46.020] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 1) [2023-12-07 13:39:46.020] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:40:03.320] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 2) [2023-12-07 13:40:03.320] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:40:35.806] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:40:35.806] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:41:09.567] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:41:09.567] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:43:04.695] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:43:04.695] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:44:11.826] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:44:11.826] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:44:28.159] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 1) [2023-12-07 13:44:28.159] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:45:01.003] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:45:01.003] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:48:01.715] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:48:01.715] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:48:50.473] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:48:50.473] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:50:11.917] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:50:11.917] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:53:12.924] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:53:12.924] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:56:32.505] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0) [2023-12-07 13:56:32.505] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:56:48.497] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 1) [2023-12-07 13:56:48.497] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:57:04.407] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 2) [2023-12-07 13:57:04.407] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:57:20.447] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 3) [2023-12-07 13:57:20.447] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:57:36.416] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 4) [2023-12-07 13:57:36.416] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error)) [2023-12-07 13:57:36.446] [critical] Failed to successfully submit GPU command buffers. libc++abi: terminating due to uncaught exception of type std::runtime_error: Failed to successfully submit GPU command buffers. Abort trap: 6

dorado could basecall reads even if warnings, but then it stopped

chilampoon avatar Dec 07 '23 19:12 chilampoon

Hi @dbernick @j-jamshidi @chilampoon,

We've been improving the stability of basecalling on Mac in recent releases.

Are you still experiencing issues?

Kind regards, Rich

HalfPhoton avatar Feb 06 '24 17:02 HalfPhoton

Hi @dbernick @j-jamshidi @chilampoon,

We've been improving the stability of basecalling on Mac in recent releases.

Are you still experiencing issues?

Kind regards, Rich

Hey @HalfPhoton, I got the same issue with a fresh install of dorado v0.5.3 on M2 Pro macbook.

Dorado command: dorado basecaller sup --kit-name SQK-16S024 --min-qscore 7 21_04_20_zfish.pod5 > 21_04_20_zfish.bam

Error code:

[2024-02-09 15:29:43.071] [info] Assuming cert location is /etc/ssl/cert.pem
[2024-02-09 15:29:43.072] [info]  - downloading [email protected] with httplib
[2024-02-09 15:29:47.393] [info] > Creating basecall pipeline
[2024-02-09 15:30:00.845] [info]  - set batch size to 432
[2024-02-09 15:30:00.845] [info] Barcode for SQK-16S024
[2024-02-09 15:31:06.899] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0)
[2024-02-09 15:31:06.899] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:31:55.668] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0)
[2024-02-09 15:31:55.668] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:32:27.847] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 0)
[2024-02-09 15:32:27.847] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:32:43.931] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 1)
[2024-02-09 15:32:43.931] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:32:59.966] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 2)
[2024-02-09 15:32:59.966] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:33:16.260] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 3)
[2024-02-09 15:33:16.260] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:33:32.559] [warning] Metal command buffer linear/scan/softmax failed: status 5 (try 4)
[2024-02-09 15:33:32.559] [warning] Command buffer error code: 1 (Internal Error (0000000e:Internal Error))
[2024-02-09 15:33:32.584] [critical] Failed to successfully submit GPU command buffers.
libc++abi: terminating due to uncaught exception of type std::runtime_error: Failed to successfully submit GPU command buffers.
Abort trap: 6

microbemarsh avatar Feb 09 '24 21:02 microbemarsh

Hi. Can you tell me how much RAM the system has? The selected batch size does not seem unreasonable, but it would be good to rule out memory swapping as a factor in this case. Does this failure happen reliably?

StuartAbercrombie avatar Feb 12 '24 09:02 StuartAbercrombie

Only 16 Gb of RAM, I can set the batch / chunk size if you think that may be the issue

microbemarsh avatar Feb 12 '24 12:02 microbemarsh

the failure is very repeatable and does not happen with the version 10 model. In my case, the system has 32GBDavidOn Feb 12, 2024, at 1:25 AM, StuartAbercrombie @.***> wrote: Hi. Can you tell me how much RAM the system has? The selected batch size does not seem unreasonable, but it would be good to rule out memory swapping as a factor in this case. Does this failure happen reliably?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

dbernick avatar Feb 12 '24 16:02 dbernick

Hi @microbemarsh,

Does reducing the --batchsize improve things?

HalfPhoton avatar Sep 17 '24 10:09 HalfPhoton