ROCm icon indicating copy to clipboard operation
ROCm copied to clipboard

[Issue]: Performance of 9070xt with ComfyUI

Open alshdavid opened this issue 10 months ago • 67 comments

Problem Description

I'm using the default ComfyUI SDXL workflow with Ubuntu 24.04 and the proprietary AMD drivers. This is also the case using Fedora 42.

Issues:

  • Using the default launch settings, the generation crashes with OOM errors
  • Using tweaked settings, generation passes but is very slow

Results:

Clean Ubuntu 24.04 installation AMD proprietary drivers Python 3.10.17 Pytorch nightly ROCm 6.4.1

export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"

python ./main.py
Card Model Steps Resolution Time Notes
6900xt SD1.5 20 512x512 2.42s ROCm 6.3
9070xt SD1.5 20 512x512 3.76s
6900xt SDXL 20 1024x1024 15.16s ROCm 6.3
9070xt SDXL 20 1024x1024 FAIL Crashed with out of memory
9070xt SDXL 20 1024x1024 30.51s Used tiled VAE decoder to avoid OOM failure

Results 2:

Clean Ubuntu 24.04 installation AMD proprietary drivers Python 3.10.17 Pytorch nightly ROCm 6.4.1

export PYTORCH_TUNABLEOP_ENABLED=1\
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"

python ./main.py --use-pytorch-cross-attention 
Model Steps Resolution Speed Time Notes
SDXL 20 1024x1024 1.49it/s 34.56s
SDXL 20 1024x1024 1.5it/s 27.76s Manual tiled VAE decoder

Operating System

Ubuntu 24.04

CPU

AMD 7950x

GPU

Radeon RX 9070xt

ROCm Version

6.4.1

ROCm Component

No response

Steps to Reproduce

  • Install Ubuntu 24.04
  • Install AMD proprietary drivers with ROCm
  • Install Python 3.10.17
  • Clone ComfyUI
  • Start ComfyUI
  • Use default workflow for SDXL

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

alshdavid avatar May 30 '25 10:05 alshdavid

Hi @alshdavid. Internal ticket has been created to investigate this issue. Thanks!

ppanchad-amd avatar May 30 '25 13:05 ppanchad-amd

Can confirm this is happening to me as well on workflow that used to work on 6700.

chutchatut avatar Jun 01 '25 03:06 chutchatut

Also related: https://github.com/comfyanonymous/ComfyUI/issues/7332

alshdavid avatar Jun 01 '25 22:06 alshdavid

Prepend MIOPEN_FIND_MODE=2 to your comfyui command.

Matthew-Jenkins avatar Jun 05 '25 01:06 Matthew-Jenkins

@Matthew-Jenkins, appears to make no difference unfortunately:

With (baseline):

export PYTORCH_TUNABLEOP_ENABLED=1
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"

python3.10 ./main.py --use-pytorch-cross-attention 
Model Steps Resolution Speed Time Notes
SDXL 20 1024x1024 1.49it/s 34.56s
SDXL 20 1024x1024 1.5it/s 27.76s Manual tiled VAE decoder

With:

export MIOPEN_FIND_MODE=2
export PYTORCH_TUNABLEOP_ENABLED=1
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"

python3.10 ./main.py --use-pytorch-cross-attention 
Model Steps Resolution Speed Time Notes
SDXL 20 1024x1024 1.46it/s 34.74s
SDXL 20 1024x1024 1.48it/s 27.88s Manual tiled VAE decoder

I also tried to use the MIGraphX node however it either doesn't work with SDXL or there is an issue with my configuration. Raised an issue on their repo: https://github.com/pnikolic-amd/ComfyUI_MIGraphX/issues/5

alshdavid avatar Jun 05 '25 02:06 alshdavid

9070 support wasn't added until 6.4.1. You'll have problems until pytorch nightly updates to use it. the hsa override can make it run as if it is a previous generation card.

The next thing to try is the HSA_OVERRIDE_GFX_VERSION . Try MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=11.0.0

if that doesn't work then try

MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=10.3.0

Until pytorch updates to 6.4.1 you will not get any benefit from the ai cores. But once it does you can expect about 1.2TFLOPs for i4 ops.

Matthew-Jenkins avatar Jun 05 '25 14:06 Matthew-Jenkins

Thanks for the tips

Try MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=11.0.0 if that doesn't work then try MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=10.3.0

Just tried both of these, unfortunately no luck there.

Until pytorch updates to 6.4.1 you will not get any benefit from the ai cores.

Ah, well that's good news. Will keep an eye out on pytorch's progress

alshdavid avatar Jun 06 '25 00:06 alshdavid

@alshdavid you need this for comfyui to work correctly https://github.com/comfyanonymous/ComfyUI/pull/8289

kasper93 avatar Jun 06 '25 17:06 kasper93

@alshdavid you need this for comfyui to work correctly comfyanonymous/ComfyUI#8289

I tried your commit with SDXL and VAE Decoder (Tiled) stills performs much worse than it used to on my old gpu(6700). Is that the same on your end?

Flag used: --use-pytorch-cross-attention

Edit: Never mind, had to turn off the experimental feature. But somehow the output image is gibberish with weird circular patterns.

Edit2: Using the newest torch from torch instead of amd fixed the weird circular patterns issue.

chutchatut avatar Jun 07 '25 10:06 chutchatut

For convolution workloads winograd solvers in miopen would help a lot, I asked about those here https://github.com/ROCm/MIOpen/issues/3750. Another issues is that GEMM tuning for gfx1201 for seems to not be optimal by default for some workloads. And probably few other areas that could improve.

With ComfyUI changes I mentioned, TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and pytroch install from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ it should work "reasonable", meaning there are still a lot performance left on the table, but it's not completely broken anymore...

Using https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/how-to-use-hipblaslt-offline-tuning.html can increase perf 2x in some cases.

kasper93 avatar Jun 07 '25 13:06 kasper93

I'm having the same problem using pytorch-rocm with Chainner. With my 6800XT, an inference took 17 while the same takes 28 seconds with 9070XT.

Loacoon1 avatar Jun 08 '25 17:06 Loacoon1

Hey @kasper93, regarding this:

and pytroch install from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ it should work "reasonable"

Are there docs on how to install PyTorch from here?

I have no idea what pip install is doing or how Python resolves its dependencies so I don't know how to pick the archives and manually reproduce the pip installation. I've been using the pip3 install command from pytorch.org, which I assume installs the packages into some global packages folder in the venv folder. Do you just download & extract the archives there?

I tried the following but that doesn't work:

pip3 install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/pytorch_triton_rocm-3.2.0%2Brocm6.4.1.git6da9e660-cp39-cp39-linux_x86_64.whl

Also, how do I know that ComfyUI is using those packages rather than the previously installed ones from pytorch.org?

Apparently, PyTorch just added support for ROCm 6.4.1 but either I am installing it wrong or it doesn't appear to make any difference.

alshdavid avatar Jun 28 '25 02:06 alshdavid

@alshdavid: Keep expectations low, it's still slow in general. It works, but it seems you have overenthusiastic expectations for it.

See my comment https://github.com/ROCm/ROCm/issues/4846#issuecomment-2952494361

Also, if you workload is only SD, AMD recently updated migrapx node for comfyui. https://github.com/pnikolic-amd/ComfyUI_MIGraphX after compiling the model with it, you will get significantly higher perf, which may meet your expectations.

Are there docs on how to install PyTorch from here?

https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-pytorch.html

But if upstream nightly pytorch support current version of rocm, you can use that directly.

kasper93 avatar Jun 28 '25 09:06 kasper93

Not directly related but I just tried ComfyUI under Windows + latest HIP + Zluda on my 9070xt and performance is about the same as ROCm under Linux.

Generating an SDXL image with 1024x1024 resolution and 20 steps takes ~31 seconds (same as ROCm under Linux).

No OOM errors though, which is nice.

alshdavid avatar Aug 08 '25 02:08 alshdavid

So this is also a Windows issue and not necessarily Rocm related?! I'm honestly amazed at how bad software support is when it come to AI. The cards were released months ago now. Clearly this is my last AMD GPU....

Loacoon1 avatar Aug 08 '25 19:08 Loacoon1

So this is also a Windows issue and not necessarily Rocm related?!

This occurs both on Windows and Linux. This thread has some interesting insights. It also looks like running an LLM using a Vulkan backend on the 9070xt shows a lot of potential so there's reason to be optimistic.

Looks like the ROCm 7.0 RC1 was cut a few hours ago so here's hoping that improves things. If I figure out how to install it (installing AI tools is madness) I'll report back.

Clearly this is my last AMD GPU....

I go back and forth on this too. I daily drive Linux and AMD GPUs have the best Linux support - otherwise I'd have gotten an nvidia 5080. It's quite frustrating that AMD's support for their own AI APIs is so poor.

The recent "AI Max" APU with up to 96gb of ram sounds perfect for low cost/power efficient local LLM inference and perhaps even training/fine-tuning however it doesn't even support ROCm. It literally has AI in the name and can't run any AI workloads.

alshdavid avatar Aug 08 '25 22:08 alshdavid

Well FSR4 is not officially supported yet... DLSS4 is... so..... But yeah, AMD performs better than NV on Linux. Still, this shouldn't happen! AMD promised much better performances with 9000 series for AI workloads and it's actually worse than previous gens and even makes the display server crash after LOTS of stuttering.

Loacoon1 avatar Aug 09 '25 05:08 Loacoon1

Hello,

I've 0 knowledge on how this work but I can give the workaround that worked out for me.

In comfyUI use the following nodes https://github.com/pollockjj/ComfyUI-MultiGPU

Completly restart the server

In you workflow use comfyui-multigpu VAE loader and setup the VAE device on the cpu...

I went from 560 seconds VAE decode on the GPU (and whole computer is stuttering like a 2000's PC try to read a 4K movie) to a smooth 35/40 second for VAE decoding on CPU. (its the speed for 2 pics in 1024 res)

FR-Mister-T avatar Aug 11 '25 00:08 FR-Mister-T

I have just tried installing ROCm 7 beta on Fedora 42 via the RHEL repos and retried generating a 1024x1024 image with SDXL on ComfyUI.

Relatively easy to install: https://www.youtube.com/watch?v=7qDlHpeTmC0 https://gist.github.com/B4rr3l-Rid3r/b03460860f2841144135c0fe8bede5be

....And it appears to perform identically to ROCm 6.4 on Linux. So no performance improvement and still slower than my 6900xt 🙄

alshdavid avatar Aug 14 '25 00:08 alshdavid

Thank you for sharing your detailed analysis and benchmarks! I'm experiencing almost identical performance issues with my 9070XT in WSL - also getting around 30-31 seconds for 1024x1024 SDXL generation with 20 steps, which matches your results exactly.

It's somewhat reassuring to know this appears to be a common issue rather than something specific to my setup. The consistency of our results suggests this is indeed a broader compatibility or optimization problem that will likely require AMD's attention to resolve.

I appreciate you taking the time to document and report this issue. Hopefully AMD can address these performance concerns in future ROCm updates, as the 9070XT should theoretically perform much better than what we're currently seeing.

dragonwise10 avatar Sep 10 '25 11:09 dragonwise10

I'm interested in side by side benchmarks for the rx 9070 xt with rocm vs vulkan. If anyone could submit benchmarks with the same model but different quants, f16, q8, q4, iq4 that would really be very interesting to me. I've been on the fence to buy a 9070 to replace my 6900 xt.

Matthew-Jenkins avatar Sep 10 '25 13:09 Matthew-Jenkins

I'm also interested in Vulkan benchmarks too. I've seen some LLM perform incredibly well on the 9070xt under Vulkan which gives me hope.

I've been trying to compile Pytorch to use a Vulkan backend but haven't been able to figure out their build process. The Vulkan backend is normally used for Android devices but IMO it would be helpful for getting GPU acceleration on under/unsupported devices (RDNA4, Strix Halo APUs, Intel GPUs, Snapdragon GPUs, etc).

I raised an issue on the pytorch repo asking them to release officially supported prebuilds of pytorch with a Vulkan backend https://github.com/pytorch/pytorch/issues/160230#issuecomment-3186343745

alshdavid avatar Sep 10 '25 21:09 alshdavid

IMO it would just be a placeholder anyway. 9000 series have dedicated AI capabilities that should, if software support wasn't abysmally bad, make them much faster than using Vulkan. At that point I'm just considering selling my card and going back to team green. The release was 6 months ago! It feels like a bad joke considering that part of the marketing of these cards was based on their AI capabilities, but they are still basically unusable.

Loacoon1 avatar Sep 10 '25 21:09 Loacoon1

What makes matters worse is my experiments with ROCm 7 doesn't seem to improve performance on my 9070xt. So much potential in these cards to kick ass but it doesn't look like we will see that materialize in the next 6 months.

I thought about switching to team green too but the 5080 doesn't seem worth the money with only 16gb RAM. In the meantime, I've been renting an Nvidia-powered VPS for ML workloads on-demand. It's like 60 cents an hour.

If the 5080 TI/Super comes out with 32gb then it's a no brainer

alshdavid avatar Sep 10 '25 22:09 alshdavid

IMO it would just be a placeholder anyway. 9000 series have dedicated AI capabilities that should, if software support wasn't abysmally bad, make them much faster than using Vulkan. At that point I'm just considering selling my card and going back to team green. The release was 6 months ago! It feels like a bad joke considering that part of the marketing of these cards was based on their AI capabilities, but they are still basically unusable.

Vulkan shaders already run on the AI cores.

Matthew-Jenkins avatar Sep 11 '25 02:09 Matthew-Jenkins

I just tested official ROCm 7.0 with the ROCm fork of Pytorch

Args: python main.py Render time: 168s Notes: Sometimes throws error: HIP error: an illegal memory access was encountered

Full Error got prompt !!! Exception during processing !!! HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Traceback (most recent call last): File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 496, in execute output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list await process_inputs(input_dict, i) File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1525, in sample return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1484, in common_ksampler noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/comfy/sample.py", line 13, in prepare_noise generator = torch.manual_seed(seed) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/random.py", line 46, in manual_seed torch.cuda.manual_seed_all(seed) File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 131, in manual_seed_all _lazy_call(cb, seed_all=True) File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/init.py", line 341, in _lazy_call callable() File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 129, in cb default_generator.manual_seed(seed) torch.AcceleratorError: HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

Args env PYTORCH_TUNABLEOP_ENABLED=1 TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python ./main.py --use-pytorch-cross-attention Render time: 6.65s 🥳 Notes: First render takes 300-500 seconds. This happens when you first load a model, change models or change VAE decode settings. Would like to see this come down. The second render took 30 seconds. The third+ render takes 6 seconds

It's now twice as fast as my old 6900xt

alshdavid avatar Sep 17 '25 03:09 alshdavid

20 steps?

---Original--- From: "David @.> Date: Wed, Sep 17, 2025 11:29 AM To: @.>; Cc: @.***>; Subject: Re: [ROCm/ROCm] [Issue]: Performance of 9070xt with ComfyUI (Issue#4846)

alshdavid left a comment (ROCm/ROCm#4846)

I just tested official ROCm 7.0 with the ROCm fork of Pytorch

Ubuntu 24.04

Python 3.12.11

ROCm 7.0 (Installed from AMD)

Pytorch 2.8.0 (Installed from AMD)

ComfyUI

SDXL

Image Size 1024 x 1024

Args: python main.py Render time: 168s Notes: Sometimes throws error: HIP error: an illegal memory access was encountered Full Error got prompt !!! Exception during processing !!! HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with TORCH_USE_HIP_DSA to enable device-side assertions. Traceback (most recent call last): File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 496, in execute output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list await process_inputs(input_dict, i) File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1525, in sample return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1484, in common_ksampler noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/ComfyUI/comfy/sample.py", line 13, in prepare_noise generator = torch.manual_seed(seed) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/random.py", line 46, in manual_seed torch.cuda.manual_seed_all(seed) File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 131, in manual_seed_all _lazy_call(cb, seed_all=True) File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/init.py", line 341, in _lazy_call callable() File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 129, in cb default_generator.manual_seed(seed) torch.AcceleratorError: HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

Args env PYTORCH_TUNABLEOP_ENABLED=1 TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python ./main.py --use-pytorch-cross-attention Render time: 6.65s 🥳 Notes: First render took 310 seconds

It's now twice as fast as my old 6900xt

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

githust66 avatar Sep 17 '25 04:09 githust66

20 steps, same workflow/config as the other tests. I tried to get the MiGraphX node working but had no luck unfortunately.

90% of the time is spent in VAE decoding, getting around 4-5 it/sec

alshdavid avatar Sep 17 '25 04:09 alshdavid

First render takes 500 seconds. When I change models first render takes 500 seconds. When I change image sizes the first render takes 500 seconds.

Also I'm occasionally getting this error:

!!! Exception during processing !!! HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 496, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/nodes.py", line 1525, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/nodes.py", line 1492, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/sample.py", line 45, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1161, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1051, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1036, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1004, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 987, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 759, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/user/python/3.12.11/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/k_diffusion/sampling.py", line 220, in sample_euler_ancestral
    sigma_down, sigma_up = get_ancestral_step(sigmas[i], sigmas[i + 1], eta=eta)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/MachineLearning/ComfyUI/comfy/k_diffusion/sampling.py", line 70, in get_ancestral_step
    sigma_up = min(sigma_to, eta * (sigma_to ** 2 * (sigma_from ** 2 - sigma_to ** 2) / sigma_from ** 2) ** 0.5)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

alshdavid avatar Sep 19 '25 00:09 alshdavid

I'm also interested in Vulkan benchmarks too. I've seen some LLM perform incredibly well on the 9070xt under Vulkan which gives me hope.

I've been trying to compile Pytorch to use a Vulkan backend but haven't been able to figure out their build process. The Vulkan backend is normally used for Android devices but IMO it would be helpful for getting GPU acceleration on under/unsupported devices (RDNA4, Strix Halo APUs, Intel GPUs, Snapdragon GPUs, etc).

I raised an issue on the pytorch repo asking them to release officially supported prebuilds of pytorch with a Vulkan backend pytorch/pytorch#160230 (comment)

https://www.phoronix.com/review/amd-rocm-7-strix-halo

Image

Reviews and independent tests show Vulkan beating ROCm by a wide margin, which is insane because it's generic code versus kernels optimized by the manufacturer itself. AMD needs to fix this.

It makes me wonder if there isn't anyone testing in realistic scenarios, on products that consumers actually use.

GreenShadows avatar Sep 22 '25 19:09 GreenShadows