ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

AMD RYZEN AI MAX+ 395 w/ Radeon 8060S not supported

Open marcushoff opened this issue 7 months ago • 6 comments

Expected Behavior

Image generated.

Actual Behavior

Error on image generation.

Steps to Reproduce

Run comfyUI on computer with AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (gfx1151)

Debug Logs

Checkpoint files will always be loaded safely.
Total VRAM 63970 MB, total RAM 127940 MB
pytorch version: 2.8.0.dev20250518+rocm6.4
AMD arch: gfx1151
Set vram state to: NORMAL_VRAM
Device: cuda:0 Radeon 8060S Graphics : hipMallocAsync
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0]
ComfyUI version: 0.3.34
ComfyUI frontend version: 1.19.9
[Prompt Server] web root: /home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/comfyui_frontend_package/static

Import times for custom nodes:
   0.0 seconds: /home/mho/ComfyUI/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load SD1ClipModel
loaded completely 62198.675 235.84423828125 True
!!! Exception during processing !!! HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/home/mho/ComfyUI/execution.py", line 349, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/execution.py", line 224, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/execution.py", line 196, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/home/mho/ComfyUI/execution.py", line 185, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/nodes.py", line 69, in encode
    return (clip.encode_from_tokens_scheduled(tokens), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd.py", line 166, in encode_from_tokens_scheduled
    pooled_dict = self.encode_from_tokens(tokens, return_pooled=return_pooled, return_dict=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd.py", line 228, in encode_from_tokens
    o = self.cond_stage_model.encode_token_weights(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd1_clip.py", line 682, in encode_token_weights
    out = getattr(self, self.clip).encode_token_weights(token_weight_pairs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd1_clip.py", line 45, in encode_token_weights
    o = self.encode(to_encode)
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd1_clip.py", line 288, in encode
    return self(tokens)
           ^^^^^^^^^^^^
  File "/home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd1_clip.py", line 250, in forward
    embeds, attention_mask, num_tokens = self.process_tokens(tokens, device)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/sd1_clip.py", line 204, in process_tokens
    tokens_embed = self.transformer.get_input_embeddings()(tokens_embed, out_dtype=torch.float32)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/ops.py", line 237, in forward
    return self.forward_comfy_cast_weights(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/ComfyUI/comfy/ops.py", line 233, in forward_comfy_cast_weights
    return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mho/miniconda3/envs/comfyui-web/lib/python3.12/site-packages/torch/nn/functional.py", line 2560, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.


Prompt executed in 0.64 seconds

Other

System Info

comfyui: 0.3.34 comfyui_frontend: 1.19.9

OS: posix Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] Embedded Python: false Pytorch Version: 2.8.0.dev20250518+rocm6.4 Arguments: main.py RAM Total: 124.94 GB RAM Free: 108.03 GB

Devices

Name: cuda:0 Radeon 8060S Graphics : hipMallocAsync Type: cuda VRAM Total: 62.47 GB VRAM Free: 61.49 GB Torch VRAM Total: 235.84 MB Torch VRAM Free: 0 B

marcushoff avatar May 20 '25 10:05 marcushoff

As far as I know Pytorch isn't compiled against gfx1151 (your APU) by default so it will not work for now. You could try with the packages of this repo, they have a pytorch version specifically for your APU. I haven't tested it because I don't own that kind of hardware but I saw that there are numerous discussions that talk about it.

wasd-tech avatar May 20 '25 17:05 wasd-tech

You're right it's Pytorch. Thank you AMD for releasing an APU with AI in it's name, but not building the AI libraries. I did try the repo and the Pytorch version supposedly working with gfx1151. It breaks requirement dependencies w ComyUI. Maybe I should try building ComfyUI normally and substituting Pytorch

marcushoff avatar May 21 '25 07:05 marcushoff

Update: Using Pytorch from the repo works.

I did a normal ComyUI install w Python 3.11 following rocm3.4 instructions. Then installed Pytorch on top of that from the repo. Ignoring warnings and manually installing any missing modules. I got it running. I think there are some things, that can be optimized. The Pytorch from the library is 2.7, it might help downgrading all other torch libraries to same level.

Checkpoint files will always be loaded safely.
Total VRAM 63970 MB, total RAM 127940 MB
pytorch version: 2.7.0a0+gitbfd8155
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.7.0a0+gitbfd8155)
    Python  3.11.3 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/home/mho/.local/lib/python3.11/site-packages/xformers/triton/softmax.py:30: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16 if _triton_softmax_fp16_enabled else None)
/home/mho/.local/lib/python3.11/site-packages/xformers/triton/softmax.py:86: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/home/mho/.local/lib/python3.11/site-packages/xformers/ops/swiglu_op.py:106: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd
/home/mho/.local/lib/python3.11/site-packages/xformers/ops/swiglu_op.py:127: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_bwd
xformers version: 0.0.20
AMD arch: gfx1151
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon Graphics : native
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
Python version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
ComfyUI version: 0.3.34
ComfyUI frontend version: 1.19.9
[Prompt Server] web root: /home/mho/miniconda3/envs/comfyui/lib/python3.11/site-packages/comfyui_frontend_package/static

Import times for custom nodes:
   0.0 seconds: /home/mho/ComfyUI/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load SD1ClipModel
loaded completely 62594.7375 235.84423828125 True
Requested to load BaseModel
loaded completely 62067.00166015625 1639.406135559082 True
100%|███████████████████████████████████████████| 20/20 [00:44<00:00,  2.21s/it]
Requested to load AutoencoderKL
loaded completely 59060.48876953125 319.11416244506836 True
Prompt executed in 157.09 seconds
got prompt
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.52it/s]
Prompt executed in 6.29 seconds


marcushoff avatar May 21 '25 08:05 marcushoff

You're right it's Pytorch. Thank you AMD for releasing an APU with AI in it's name, but not building the AI libraries.

I know what you mean, I have an rx 9070 xt. I think that only with ROCm 7.0 there will be a good compatibility at the same level of CUDA. I am curious of one thing about this type of APU. Can you test that the gpu has access to all the memory of pc? It will be useful to other people who have this kind of hardware. Maybe you can try to run a huge batch and see if it use all the memory.

wasd-tech avatar May 21 '25 10:05 wasd-tech

I tried, but gave up after running a batch of 8. It just takes forever to load. Once it's loaded running the second batch is no problem. I assume this has something to do with the implementation not being optimized. Someone on Reddit said that the stock Fedora rocm and pytorch should work with the APU. So I'm going to try to copy those modules in to comyui and see if it works next.

marcushoff avatar May 24 '25 08:05 marcushoff

Can you provide benchmarks of your results when you get it working, I am really interested in this PC, thinking of getting the GMKtec EVO-X2.

CaptailTyler avatar May 25 '25 00:05 CaptailTyler

I will, as you can see from above it sits somewhere between 1.5-3.5 it/s with the current pytorch build.

marcushoff avatar May 28 '25 14:05 marcushoff

@marcushoff hi , i just download https://github.com/ROCm/TheRock/pkgs/container/therock_pytorch_dev_ubuntu_24_04_gfx1151, and activate the docker. how do i get the pytorch and install comfyui

Thankyou

icarus0508 avatar Jun 04 '25 20:06 icarus0508

You need to build comfyui in a Python 3.11 environment. At the step, where you should install torch, audio and vision, you install the downloaded version. Then continue as normal.

marcushoff avatar Jun 07 '25 09:06 marcushoff

@marcushoff Thx it works, only if i should install Numpy = 1.26.4 intead

Really help a lot

icarus0508 avatar Jun 10 '25 09:06 icarus0508

@marcushoff I also need to thank you about your experiment with new hardware. I made a guide for ROCm with the list of commands to install all the major programs and your results with AMD RYZEN AI MAX+ 395 are priceless. I will surely add them on the guide.

wasd-tech avatar Jun 10 '25 12:06 wasd-tech

No need to thank me. Would like to see you guide when it's done. BTW, there's a new pytorch out for 3.12 now.

marcushoff avatar Jun 11 '25 09:06 marcushoff

@marcushoff I already wrote the guide, you can find it here. Unfortunately it is still in Italian, but I am translating it in English. I've divided it by topics so it's easy to search for what you want to install. The reason it's in Italian is that it was initially a list of commands, with comments, that I used to install programs for AMD, but then I turned it into a full guide.

wasd-tech avatar Jun 11 '25 12:06 wasd-tech

I really appreciate this thread, so I will write down all the steps that I did to make it works:

conda create -n comfyenv
conda activate comfyenv

git clone [email protected]:comfyanonymous/ComfyUI.git
cd ComfyUI

# check python version 
# python 3.11.9

python -m pip install --upgrade pip
pip install -r requirements.txt

# download the 3 torch file from here https://github.com/scottt/rocm-TheRock/releases/tag/v6.5.0rc-pytorch-gfx110x
pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl
pip install torchaudio-2.7.0a0+52638ef-cp311-cp311-win_amd64.whl
pip install torchvision-0.22.0+9eb57cd-cp311-cp311-win_amd64.whl

python main.py

0xIslamTaha avatar Jul 02 '25 14:07 0xIslamTaha

dockerfile that works for me:

FROM ghcr.io/rocm/therock_pytorch_dev_ubuntu_24_04_gfx1151:main

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && \
    apt -y install git
    
RUN git clone https://github.com/comfyanonymous/ComfyUI.git /app

WORKDIR /app

COPY requirements.txt /app/requirements.txt
RUN python3 -m venv --system-site-packages /venv && \
    /venv/bin/pip install -r requirements.txt

ENV LD_LIBRARY_PATH=/opt/rocm/lib/llvm/lib:

CMD ["/venv/bin/python3", "main.py", "--listen", "--port", "8188", "--base-directory", "/data/"]
EXPOSE 8188

(where by "works" I mean "comfyui starts up and workflows run fine up to the VAE, where they crash with a page fault in dmesg, as documented here")

edited to add: initially I had to supply requirements.txt because I was using the rocm/pytorch image as a base, and it had no torchaudio, and simply installing torchaudio with pip pulled a whole different torch that's nvidia-specific and wouldn't work; afterwards I discovered that the comfyui manager also has some dependencies it needs, so I've added those to the requirements even if I keep custom_nodes in a volume - I hate the manager because it's highly incompatible with containerization, but it offers a convenient way to download a number of models I wouldn't know where to find otherwise)

m0n5t3r avatar Jul 11 '25 11:07 m0n5t3r

"comfyui starts up and workflows run fine up to the VAE, where they crash with a page fault in dmesg, as documented https://github.com/ROCm/TheRock/issues/986"

@m0n5t3r You can try to run the tiled version of the VAE Decode and also setting is precision to bf16 by adding --bf16-vae at the end of the command. It is a know issue by the way, I also have it with my rx 9070 xt.

wasd-tech avatar Jul 11 '25 16:07 wasd-tech

that doesn't help in this case, but works almost fine with --cpu-vae (the CPU is fast enough that there's not much difference, maybe it's even faster at least for what I've tried); there are random crashes in CLIP occasionally, and reliable crashes in the upscaler-with-model, just like the VAE before (tried 2 upscaler models, but I don't really now what I'm doing)

m0n5t3r avatar Jul 11 '25 18:07 m0n5t3r

I encountered the same problem, whether it was with comfyui, invoke-ai, or even the source code.

from diffusers import StableDiffusionPipeline import torch

model_id = "sd-legacy/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")

They will all crash at approximately the VAE position, throwing an "Aborted (core dumped)" error.

KoDelioDa avatar Jul 19 '25 17:07 KoDelioDa

@m0n5t3r @marcushoff @KoDelioDa and all the people that have a gfx1151 AMD just released ROCm 6.4.2 and it says:

hipBLASLt (0.12.1) Added Support for gfx1151 on Linux, complementing the previous support in the HIP SDK for Windows.

https://rocm.docs.amd.com/en/latest/about/release-notes.html I suggest that if there are errors with torch downloaded from the official pytorch, you can try to download it from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/. Unfortunately this suggestion only applies to linux though.

I hope it solves some of the problems

wasd-tech avatar Jul 22 '25 16:07 wasd-tech

Thanks for the info, updating the host amdgpu driver (and ROCm, but that's more or less irrelevant for the containers I think) seems to have made some things much faster (LLMs mainly, but maybe image models too)

The issues with VAE and upscalers persist though; TheRock folks build / release a pre-release version of ROCm anyway (7.0.0 right now) and torch 2.7.0:

Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rocm_sdk
>>> rocm_sdk.__version__
'7.0.0rc20250715'
>>> import torch
>>> torch.__version__
'2.7.0a0+rocm7.0.0rc20250715'

I don't know what VAE and upscalers do differently from the image models (or why the FLUX CLIP crashes a few times at first and then works flawlessly), but I have a suspicion there's not much Comfy can do about it, though, since crashing happens in somewhere in torch code, or even the drivers - amdgpu complains about page faults in dmesg.

m0n5t3r avatar Jul 23 '25 13:07 m0n5t3r

@m0n5t3r @wasd-tech


Thank you for your suggestion. I was very excited when I received this message. However, after installing ROCm 6.4.2 and trying to install PyTorch both from the official website and from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/, I still encountered errors. This time, the issue seems more complex and is likely caused by changes in the latest ROCm version.

  1. When I used the official PyTorch build and ran ComfyUI, it showed that no CUDA device was found.

  2. When I used the PyTorch build from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/, I received the following error:

    File "/home/node1/ComfyUI/comfy/ops.py", line 237, in forward
        return self.forward_comfy_cast_weights(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/node1/ComfyUI/comfy/ops.py", line 233, in forward_comfy_cast_weights
        return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/node1/miniconda3/envs/cf2/lib/python3.12/site-packages/torch/nn/functional.py", line 2551, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    RuntimeError: HIP error: invalid device function
    HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing AMD_SERIALIZE_KERNEL=3
    Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
    
  3. Following the error in point 2, I manually set environment variables based on [this script](https://github.com/lhl/strix-halo-testing/blob/main/rocm-env.sh) in an attempt to fix the issue. After doing so, the following error occurred:

    got prompt
    model weight dtype torch.float16, manual cast: None
    model_type EPS
    Segmentation fault (core dumped)
    

KoDelioDa avatar Jul 23 '25 14:07 KoDelioDa

for your point 1, pytorch isn't distributing official builds for ROCm 6.4 yet, you'd have to install them from nightly (https://pytorch.org/get-started/locally/ - select preview, linux, pip, python and it will give you the command to install torch; I didn't have much luck with it)

m0n5t3r avatar Jul 23 '25 14:07 m0n5t3r

@KoDelioDa I'm sorry if I gave you false expectations :cry: I don't have that hardware some I can only point sources.

@m0n5t3r I also think there are more general problems with ROCm/Torch than with specific programs. I encountered very similar problems with all the AI programs that I tried (ComfyUI, WanGP, fluxgym, kohya_ss). @KoDelioDa pointed out that even transformer itself has problems. And the most important thing is that there are identical bugs on different GPUs (I have an rx 9070 xt).

If someone want to open an issue on the ROCm/torch side (regarding for example the VAE bug) I will surely join the conversation

wasd-tech avatar Jul 23 '25 14:07 wasd-tech


When issue 3 occurred, I went back to issue 2 and set environment variables to spoof the system, as shown in 4:

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export AMD_SERIALIZE_KERNEL=3

Then I launched ComfyUI using:

python main.py --listen 0.0.0.0

However, when I clicked "Run" to start generating images, an error still occurred, as shown in 5:

To see the GUI go to: http://0.0.0.0:8188

got prompt

model weight dtype torch.float16, manual cast: None

model_type EPS

Using split attention in VAE

Using split attention in VAE

VAE load device: cuda:0, offload device: cpu, dtype: torch.float32

CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16

loaded diffusion model directly to GPU

Requested to load BaseModel

loaded completely 9.5367431640625e+25 1639.406135559082 True

Requested to load SD1ClipModel

loaded completely 95122.82880859375 235.84423828125 True

0%| | 0/20 [00:00<?, ?it/s]:0:rocdevice.cpp :2994: 3461243762 us: Callback: Queue 0x7501ba000000 aborting with error : HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid. code: 0x100f

Aborted (core dumped)

After the error in 5, I went back again to 4, kept the environment variables, and started ComfyUI with the following command in 6:

python main.py --listen 0.0.0.0 --force-fp32 --fp32-unet --fp32-vae --fp32-text-enc

This time it worked:

got prompt
model weight dtype torch.float32, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
Requested to load SD1ClipModel
loaded completely 9.5367431640625e+25 471.6884765625 True
CLIP/text encoder model load device: cpu, offload device: cpu, current: cpu, dtype: torch.float32
loaded diffusion model directly to GPU
Requested to load BaseModel
loaded completely 9.5367431640625e+25 3278.812271118164 True
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  2.99it/s]
Requested to load AutoencoderKL
loaded completely 91885.9658203125 319.11416244506836 True
Prompt executed in 8.12 seconds
got prompt
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:04<00:00,  3.23s/it]
Prompt executed in 65.80 seconds

@m0n5t3r @wasd-tech

Thank you for your suggestion. I was very excited when I received this message. However, after installing ROCm 6.4.2 and trying to install PyTorch both from the official website and from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/, I still encountered errors. This time, the issue seems more complex and is likely caused by changes in the latest ROCm version.谢谢你的建议。当我收到这条消息时,我非常兴奋。但是,在安装 ROCm 6.4.2 并尝试从官网和 https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/ 安装 PyTorch 后,我仍然遇到错误。这一次,问题似乎更加复杂,很可能是由最新 ROCm 版本的更改引起的。

  1. When I used the official PyTorch build and ran ComfyUI, it showed that no CUDA device was found.当我使用官方 PyTorch 版本并运行 ComfyUI 时,它显示没有找到 CUDA 设备。
  2. When I used the PyTorch build from https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/, I received the following error:当我使用 https://repo.radeon.com/rocm/manylinux/rocm-rel-6.4.2/ 的 PyTorch 版本时,我收到以下错误:
    File "/home/node1/ComfyUI/comfy/ops.py", line 237, in forward
        return self.forward_comfy_cast_weights(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/node1/ComfyUI/comfy/ops.py", line 233, in forward_comfy_cast_weights
        return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/node1/miniconda3/envs/cf2/lib/python3.12/site-packages/torch/nn/functional.py", line 2551, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    RuntimeError: HIP error: invalid device function
    HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing AMD_SERIALIZE_KERNEL=3
    Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
    
  3. Following the error in point 2, I manually set environment variables based on [this script](https://github.com/lhl/strix-halo-testing/blob/main/rocm-env.sh) in an attempt to fix the issue. After doing so, the following error occurred:在第 2 点出现错误之后,我根据 [ 此脚本 ](https://github.com/lhl/strix-halo-testing/blob/main/rocm-env.sh) 手动设置环境变量,以尝试解决该问题。这样做后,发生以下错误:
    got prompt
    model weight dtype torch.float16, manual cast: None
    model_type EPS
    Segmentation fault (core dumped)
    

A. I haven't tested images with different precisions. B. I haven't tested it on Flux. Maybe I won't test Flux. My current project only needs SDXL. C. It doesn't seem necessary to use Rocm 6.4.2 to use environment variables. I have seen other projects (LLM) using Rocm of gfx1100 to run. I have tested 8060S in different projects (image generation, LLM, etc.). I think this is the incompatibility between the underlying Rocm and Pytorch, and this incompatibility may have many various problems, which will make the comfyui experience poor. Anyway, I can run sdxl now, my project is complete, it's time to use CUDA lol)

KoDelioDa avatar Jul 23 '25 14:07 KoDelioDa

Mine works fine. Arch Linux + Linux 6.15.7-arch1-1

sudo pacman -S python-pytorch-opt-rocm
yay -S python-torchvision-rocm
python -m pip install -r requirements.txt --break-system-packages  # comfyui packages only

I didn't install torchaudio since I don't need it.

-> % python main.py --listen 0.0.0.0 --fast fp16_accumulation
Checkpoint files will always be loaded safely.
Total VRAM 65536 MB, total RAM 63933 MB
pytorch version: 2.7.1
AMD arch: gfx1151
ROCm version: (6, 4)
Set vram state to: NORMAL_VRAM
Device: cuda:0 Radeon 8060S Graphics : native
Using pytorch attention
torchaudio missing, ACE model will be broken
torchaudio missing, ACE model will be broken
Python version: 3.13.5 (main, Jun 21 2025, 09:35:00) [GCC 15.1.1 20250425]
ComfyUI version: 0.3.44
ComfyUI frontend version: 1.23.4
[Prompt Server] web root: /home/<hide>/.local/lib/python3.13/site-packages/comfyui_frontend_package/static

Import times for custom nodes:
   0.0 seconds: /home/<hide>/workspace/ComfyUI/custom_nodes/websocket_image_save.py

Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://0.0.0.0:8188
^C
Stopped server

fanyang89 avatar Jul 26 '25 11:07 fanyang89

yeah, but does it actually generate images? because it starts for those who see the issue, and then it crashes on VAE or upscaling

m0n5t3r avatar Jul 26 '25 16:07 m0n5t3r

Get worked ComfyUI on AI MAX+ 395, install triton to avoid the VAE error.

  • Install docker
  • Clone ComfyUI
apt update
apt install git -y
git clone https://github.com/comfyanonymous/ComfyUI
  • Launch container with official ROCm container
 docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video \
-v `pwd`/ComfyUI:/opt/ComfyUI -p 8188:8188 \
--ipc=host --shm-size 8G ghcr.io/rocm/therock_pytorch_dev_ubuntu_24_04_gfx1151:main
  • Fix OpenMP library
echo '/opt/rocm/lib/llvm/lib' >> /etc/ld.so.conf.d/rocm.conf
rm /etc/ld.so.cache
ldconfig
  • Install packages
cd /opt/ComfyUI
pip install -r requirements.txt  --break-system-packages
pip install triton==3.2.0 # fix VAE crash on GPU
  • Launch ComfyUI
PYTORCH_TUNABLEOP_ENABLED=1 MIOPEN_FIND_MODE=FAST ROCBLAS_USE_HIPBLASLT=1 python3 main.py --listen 0.0.0.0

pccr10001 avatar Aug 23 '25 00:08 pccr10001

Just curious if anyone has gotten this to work?

treviloan avatar Aug 24 '25 17:08 treviloan

Chiming in because I have the same hardware and seem to run into same issue(s) mentioned here.

vchrizz avatar Aug 24 '25 21:08 vchrizz

Same AMD 395 hardware and issues, am continuing to follow progress ......

kenny8zeng avatar Aug 25 '25 00:08 kenny8zeng