[Issue]: Performance of 9070xt with ComfyUI
Problem Description
I'm using the default ComfyUI SDXL workflow with Ubuntu 24.04 and the proprietary AMD drivers. This is also the case using Fedora 42.
Issues:
- Using the default launch settings, the generation crashes with OOM errors
- Using tweaked settings, generation passes but is very slow
Results:
Clean Ubuntu 24.04 installation AMD proprietary drivers Python 3.10.17 Pytorch nightly ROCm 6.4.1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"
python ./main.py
| Card | Model | Steps | Resolution | Time | Notes |
|---|---|---|---|---|---|
| 6900xt | SD1.5 | 20 | 512x512 | 2.42s | ROCm 6.3 |
| 9070xt | SD1.5 | 20 | 512x512 | 3.76s | |
| 6900xt | SDXL | 20 | 1024x1024 | 15.16s | ROCm 6.3 |
| 9070xt | SDXL | 20 | 1024x1024 | FAIL | Crashed with out of memory |
| 9070xt | SDXL | 20 | 1024x1024 | 30.51s | Used tiled VAE decoder to avoid OOM failure |
Results 2:
Clean Ubuntu 24.04 installation AMD proprietary drivers Python 3.10.17 Pytorch nightly ROCm 6.4.1
export PYTORCH_TUNABLEOP_ENABLED=1\
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"
python ./main.py --use-pytorch-cross-attention
| Model | Steps | Resolution | Speed | Time | Notes |
|---|---|---|---|---|---|
| SDXL | 20 | 1024x1024 | 1.49it/s | 34.56s | |
| SDXL | 20 | 1024x1024 | 1.5it/s | 27.76s | Manual tiled VAE decoder |
Operating System
Ubuntu 24.04
CPU
AMD 7950x
GPU
Radeon RX 9070xt
ROCm Version
6.4.1
ROCm Component
No response
Steps to Reproduce
- Install Ubuntu 24.04
- Install AMD proprietary drivers with ROCm
- Install Python 3.10.17
- Clone ComfyUI
- Start ComfyUI
- Use default workflow for SDXL
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Hi @alshdavid. Internal ticket has been created to investigate this issue. Thanks!
Can confirm this is happening to me as well on workflow that used to work on 6700.
Also related: https://github.com/comfyanonymous/ComfyUI/issues/7332
Prepend MIOPEN_FIND_MODE=2 to your comfyui command.
@Matthew-Jenkins, appears to make no difference unfortunately:
With (baseline):
export PYTORCH_TUNABLEOP_ENABLED=1
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"
python3.10 ./main.py --use-pytorch-cross-attention
| Model | Steps | Resolution | Speed | Time | Notes |
|---|---|---|---|---|---|
| SDXL | 20 | 1024x1024 | 1.49it/s | 34.56s | |
| SDXL | 20 | 1024x1024 | 1.5it/s | 27.76s | Manual tiled VAE decoder |
With:
export MIOPEN_FIND_MODE=2
export PYTORCH_TUNABLEOP_ENABLED=1
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export TORCH_COMMAND="--pre torch torchvision torchaudio pytorch-triton-rocm --index-url https://download.pytorch.org/whl/nightly/rocm6.4"
python3.10 ./main.py --use-pytorch-cross-attention
| Model | Steps | Resolution | Speed | Time | Notes |
|---|---|---|---|---|---|
| SDXL | 20 | 1024x1024 | 1.46it/s | 34.74s | |
| SDXL | 20 | 1024x1024 | 1.48it/s | 27.88s | Manual tiled VAE decoder |
I also tried to use the MIGraphX node however it either doesn't work with SDXL or there is an issue with my configuration. Raised an issue on their repo: https://github.com/pnikolic-amd/ComfyUI_MIGraphX/issues/5
9070 support wasn't added until 6.4.1. You'll have problems until pytorch nightly updates to use it. the hsa override can make it run as if it is a previous generation card.
The next thing to try is the HSA_OVERRIDE_GFX_VERSION . Try MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=11.0.0
if that doesn't work then try
MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=10.3.0
Until pytorch updates to 6.4.1 you will not get any benefit from the ai cores. But once it does you can expect about 1.2TFLOPs for i4 ops.
Thanks for the tips
Try
MIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=11.0.0if that doesn't work then tryMIOPEN_FIND_MODE=2 HSA_OVERRIDE_GFX_VERSION=10.3.0
Just tried both of these, unfortunately no luck there.
Until pytorch updates to 6.4.1 you will not get any benefit from the ai cores.
Ah, well that's good news. Will keep an eye out on pytorch's progress
@alshdavid you need this for comfyui to work correctly https://github.com/comfyanonymous/ComfyUI/pull/8289
@alshdavid you need this for comfyui to work correctly comfyanonymous/ComfyUI#8289
I tried your commit with SDXL and VAE Decoder (Tiled) stills performs much worse than it used to on my old gpu(6700). Is that the same on your end?
Flag used: --use-pytorch-cross-attention
Edit: Never mind, had to turn off the experimental feature. But somehow the output image is gibberish with weird circular patterns.
Edit2: Using the newest torch from torch instead of amd fixed the weird circular patterns issue.
For convolution workloads winograd solvers in miopen would help a lot, I asked about those here https://github.com/ROCm/MIOpen/issues/3750. Another issues is that GEMM tuning for gfx1201 for seems to not be optimal by default for some workloads. And probably few other areas that could improve.
With ComfyUI changes I mentioned, TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 and pytroch install from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ it should work "reasonable", meaning there are still a lot performance left on the table, but it's not completely broken anymore...
Using https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/how-to-use-hipblaslt-offline-tuning.html can increase perf 2x in some cases.
I'm having the same problem using pytorch-rocm with Chainner. With my 6800XT, an inference took 17 while the same takes 28 seconds with 9070XT.
Hey @kasper93, regarding this:
and pytroch install from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/ it should work "reasonable"
Are there docs on how to install PyTorch from here?
I have no idea what pip install is doing or how Python resolves its dependencies so I don't know how to pick the archives and manually reproduce the pip installation. I've been using the pip3 install command from pytorch.org, which I assume installs the packages into some global packages folder in the venv folder. Do you just download & extract the archives there?
I tried the following but that doesn't work:
pip3 install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.1/pytorch_triton_rocm-3.2.0%2Brocm6.4.1.git6da9e660-cp39-cp39-linux_x86_64.whl
Also, how do I know that ComfyUI is using those packages rather than the previously installed ones from pytorch.org?
Apparently, PyTorch just added support for ROCm 6.4.1 but either I am installing it wrong or it doesn't appear to make any difference.
@alshdavid: Keep expectations low, it's still slow in general. It works, but it seems you have overenthusiastic expectations for it.
See my comment https://github.com/ROCm/ROCm/issues/4846#issuecomment-2952494361
Also, if you workload is only SD, AMD recently updated migrapx node for comfyui. https://github.com/pnikolic-amd/ComfyUI_MIGraphX after compiling the model with it, you will get significantly higher perf, which may meet your expectations.
Are there docs on how to install PyTorch from here?
https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-pytorch.html
But if upstream nightly pytorch support current version of rocm, you can use that directly.
Not directly related but I just tried ComfyUI under Windows + latest HIP + Zluda on my 9070xt and performance is about the same as ROCm under Linux.
Generating an SDXL image with 1024x1024 resolution and 20 steps takes ~31 seconds (same as ROCm under Linux).
No OOM errors though, which is nice.
So this is also a Windows issue and not necessarily Rocm related?! I'm honestly amazed at how bad software support is when it come to AI. The cards were released months ago now. Clearly this is my last AMD GPU....
So this is also a Windows issue and not necessarily Rocm related?!
This occurs both on Windows and Linux. This thread has some interesting insights. It also looks like running an LLM using a Vulkan backend on the 9070xt shows a lot of potential so there's reason to be optimistic.
Looks like the ROCm 7.0 RC1 was cut a few hours ago so here's hoping that improves things. If I figure out how to install it (installing AI tools is madness) I'll report back.
Clearly this is my last AMD GPU....
I go back and forth on this too. I daily drive Linux and AMD GPUs have the best Linux support - otherwise I'd have gotten an nvidia 5080. It's quite frustrating that AMD's support for their own AI APIs is so poor.
The recent "AI Max" APU with up to 96gb of ram sounds perfect for low cost/power efficient local LLM inference and perhaps even training/fine-tuning however it doesn't even support ROCm. It literally has AI in the name and can't run any AI workloads.
Well FSR4 is not officially supported yet... DLSS4 is... so..... But yeah, AMD performs better than NV on Linux. Still, this shouldn't happen! AMD promised much better performances with 9000 series for AI workloads and it's actually worse than previous gens and even makes the display server crash after LOTS of stuttering.
Hello,
I've 0 knowledge on how this work but I can give the workaround that worked out for me.
In comfyUI use the following nodes https://github.com/pollockjj/ComfyUI-MultiGPU
Completly restart the server
In you workflow use comfyui-multigpu VAE loader and setup the VAE device on the cpu...
I went from 560 seconds VAE decode on the GPU (and whole computer is stuttering like a 2000's PC try to read a 4K movie) to a smooth 35/40 second for VAE decoding on CPU. (its the speed for 2 pics in 1024 res)
I have just tried installing ROCm 7 beta on Fedora 42 via the RHEL repos and retried generating a 1024x1024 image with SDXL on ComfyUI.
Relatively easy to install: https://www.youtube.com/watch?v=7qDlHpeTmC0 https://gist.github.com/B4rr3l-Rid3r/b03460860f2841144135c0fe8bede5be
....And it appears to perform identically to ROCm 6.4 on Linux. So no performance improvement and still slower than my 6900xt 🙄
Thank you for sharing your detailed analysis and benchmarks! I'm experiencing almost identical performance issues with my 9070XT in WSL - also getting around 30-31 seconds for 1024x1024 SDXL generation with 20 steps, which matches your results exactly.
It's somewhat reassuring to know this appears to be a common issue rather than something specific to my setup. The consistency of our results suggests this is indeed a broader compatibility or optimization problem that will likely require AMD's attention to resolve.
I appreciate you taking the time to document and report this issue. Hopefully AMD can address these performance concerns in future ROCm updates, as the 9070XT should theoretically perform much better than what we're currently seeing.
I'm interested in side by side benchmarks for the rx 9070 xt with rocm vs vulkan. If anyone could submit benchmarks with the same model but different quants, f16, q8, q4, iq4 that would really be very interesting to me. I've been on the fence to buy a 9070 to replace my 6900 xt.
I'm also interested in Vulkan benchmarks too. I've seen some LLM perform incredibly well on the 9070xt under Vulkan which gives me hope.
I've been trying to compile Pytorch to use a Vulkan backend but haven't been able to figure out their build process. The Vulkan backend is normally used for Android devices but IMO it would be helpful for getting GPU acceleration on under/unsupported devices (RDNA4, Strix Halo APUs, Intel GPUs, Snapdragon GPUs, etc).
I raised an issue on the pytorch repo asking them to release officially supported prebuilds of pytorch with a Vulkan backend https://github.com/pytorch/pytorch/issues/160230#issuecomment-3186343745
IMO it would just be a placeholder anyway. 9000 series have dedicated AI capabilities that should, if software support wasn't abysmally bad, make them much faster than using Vulkan. At that point I'm just considering selling my card and going back to team green. The release was 6 months ago! It feels like a bad joke considering that part of the marketing of these cards was based on their AI capabilities, but they are still basically unusable.
What makes matters worse is my experiments with ROCm 7 doesn't seem to improve performance on my 9070xt. So much potential in these cards to kick ass but it doesn't look like we will see that materialize in the next 6 months.
I thought about switching to team green too but the 5080 doesn't seem worth the money with only 16gb RAM. In the meantime, I've been renting an Nvidia-powered VPS for ML workloads on-demand. It's like 60 cents an hour.
If the 5080 TI/Super comes out with 32gb then it's a no brainer
IMO it would just be a placeholder anyway. 9000 series have dedicated AI capabilities that should, if software support wasn't abysmally bad, make them much faster than using Vulkan. At that point I'm just considering selling my card and going back to team green. The release was 6 months ago! It feels like a bad joke considering that part of the marketing of these cards was based on their AI capabilities, but they are still basically unusable.
Vulkan shaders already run on the AI cores.
I just tested official ROCm 7.0 with the ROCm fork of Pytorch
- Ubuntu 24.04
- Python 3.12.11
- ROCm 7.0 (Installed from AMD)
- Pytorch 2.8.0 (Installed from AMD)
- ComfyUI
- SDXL
- Image Size 1024 x 1024
Args: python main.py
Render time: 168s
Notes: Sometimes throws error: HIP error: an illegal memory access was encountered
Full Error
got prompt !!! Exception during processing !!! HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.Traceback (most recent call last):
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1525, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1484, in common_ksampler
noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/comfy/sample.py", line 13, in prepare_noise
generator = torch.manual_seed(seed)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 131, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/init.py", line 341, in _lazy_call
callable()
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 129, in cb
default_generator.manual_seed(seed)
torch.AcceleratorError: HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
Args env PYTORCH_TUNABLEOP_ENABLED=1 TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python ./main.py --use-pytorch-cross-attention
Render time: 6.65s 🥳
Notes: First render takes 300-500 seconds. This happens when you first load a model, change models or change VAE decode settings. Would like to see this come down. The second render took 30 seconds. The third+ render takes 6 seconds
It's now twice as fast as my old 6900xt
20 steps?
---Original--- From: "David @.> Date: Wed, Sep 17, 2025 11:29 AM To: @.>; Cc: @.***>; Subject: Re: [ROCm/ROCm] [Issue]: Performance of 9070xt with ComfyUI (Issue#4846)
alshdavid left a comment (ROCm/ROCm#4846)
I just tested official ROCm 7.0 with the ROCm fork of Pytorch
Ubuntu 24.04
Python 3.12.11
ROCm 7.0 (Installed from AMD)
Pytorch 2.8.0 (Installed from AMD)
ComfyUI
SDXL
Image Size 1024 x 1024
Args: python main.py
Render time: 168s
Notes: Sometimes throws error: HIP error: an illegal memory access was encountered
Full Error got prompt !!! Exception during processing !!! HIP error: an illegal memory access was encountered HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
Traceback (most recent call last):
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "/home/dalsh/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1525, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/nodes.py", line 1484, in common_ksampler
noise = comfy.sample.prepare_noise(latent_image, seed, batch_inds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/ComfyUI/comfy/sample.py", line 13, in prepare_noise
generator = torch.manual_seed(seed)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 131, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/init.py", line 341, in _lazy_call
callable()
File "/home/dalsh/MachineLearning/.local/python/3.12.11/lib/python3.12/site-packages/torch/cuda/random.py", line 129, in cb
default_generator.manual_seed(seed)
torch.AcceleratorError: HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
Args env PYTORCH_TUNABLEOP_ENABLED=1 TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 python ./main.py --use-pytorch-cross-attention Render time: 6.65s 🥳 Notes: First render took 310 seconds
It's now twice as fast as my old 6900xt
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
20 steps, same workflow/config as the other tests. I tried to get the MiGraphX node working but had no luck unfortunately.
90% of the time is spent in VAE decoding, getting around 4-5 it/sec
First render takes 500 seconds. When I change models first render takes 500 seconds. When I change image sizes the first render takes 500 seconds.
Also I'm occasionally getting this error:
!!! Exception during processing !!! HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "/mnt/data/MachineLearning/ComfyUI/execution.py", line 277, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/nodes.py", line 1525, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/nodes.py", line 1492, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/sample.py", line 45, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1161, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1051, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1036, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 1004, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 987, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/samplers.py", line 759, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/user/python/3.12.11/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/k_diffusion/sampling.py", line 220, in sample_euler_ancestral
sigma_down, sigma_up = get_ancestral_step(sigmas[i], sigmas[i + 1], eta=eta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/MachineLearning/ComfyUI/comfy/k_diffusion/sampling.py", line 70, in get_ancestral_step
sigma_up = min(sigma_to, eta * (sigma_to ** 2 * (sigma_from ** 2 - sigma_to ** 2) / sigma_from ** 2) ** 0.5)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: HIP error: an illegal memory access was encountered
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
I'm also interested in Vulkan benchmarks too. I've seen some LLM perform incredibly well on the 9070xt under Vulkan which gives me hope.
I've been trying to compile Pytorch to use a Vulkan backend but haven't been able to figure out their build process. The Vulkan backend is normally used for Android devices but IMO it would be helpful for getting GPU acceleration on under/unsupported devices (RDNA4, Strix Halo APUs, Intel GPUs, Snapdragon GPUs, etc).
I raised an issue on the pytorch repo asking them to release officially supported prebuilds of pytorch with a Vulkan backend pytorch/pytorch#160230 (comment)
https://www.phoronix.com/review/amd-rocm-7-strix-halo
Reviews and independent tests show Vulkan beating ROCm by a wide margin, which is insane because it's generic code versus kernels optimized by the manufacturer itself. AMD needs to fix this.
It makes me wonder if there isn't anyone testing in realistic scenarios, on products that consumers actually use.