Results 2 comments of Repeerc

I have implemented the v2 minimal version on AMD GPU: https://github.com/Repeerc/flash-attention-v2-RDNA3-minimal and changing rocWMMA to CUDA wmma should work on Nvidia's tensor core

![Image](https://github.com/user-attachments/assets/dbeb1b16-af7f-4075-a75e-0fe2b403da7a) 7900XTX can run in Windows by using ZLUDA. 7B model use about 16GB VRAM ([parallel_size=1](https://github.com/deepseek-ai/Janus/blob/main/generation_inference.py#L60), generate 1 image at a time) add this code after `import torch` [line 20](https://github.com/deepseek-ai/Janus/blob/main/generation_inference.py#L20)...