侯奇 comments

Results 45 comments of


                                            侯奇

[QUESTION] Can flux run on RTX 4090?

> > [@qinghon](https://github.com/qinghon) Flux does support SM89, but not in current released version yet. > > hi [@wenlei-bao](https://github.com/wenlei-bao), could you please provide an estimated timeline for PCIe support in [#32](https://github.com/bytedance/flux/issues/32#issuecomment-2300527153)...

[BUG] Weird Behavior

flux #0: total 250.066 us, gemm 384.129 us, comm -134.063 us, gemm_only 191.983 us * total is measured with AG+GEMM * gemm is measured with a separated GEMM only implementation...

[QUESTION]some questions about allgather+gemm

1. for PCI-e machines, better use ring mode. all-to-all is for NVLink 2. nop 3. local_copy is not disabled? 4. there should not be any difference. if you find a...

[BUG]cannot compile project

please make sure you get the latest. follow the doc and start from scratch. if there is still a problem, please provide with more info like you nvcc version and...

[BUG]cannot compile project

is this fixed? close for too long no activation. feel free to re-open it

[BUG] Illegal memory with multi-node

> Does nvshmem support multi-machine p2p? Thanks! [@wenlei-bao](https://github.com/wenlei-bao) it does support multi machine. but here it seems to be a BUG. please provide your test command.

[QUESTION] E2E Overlap: Flux design

for dense, with sequence parallel, AR +LN is converted into RS + LN + AG. in which AG for AllGather, LN for LayerNorm, RS for reduce_scatter. for the FFN part,...

[QUESTION] E2E Overlap: Flux design

> In the end-to-end (E2E) implementation, you have used Tensor Parallelism, correct? yes > How are you handling Reduce Scatter (RS) after post-projection? post-projection is a GEMM too. usually post-projection...

[QUESTION] E2E Overlap: Flux design

> I have read articles about Flux and noticed that the paper mentions a `TP+SP` approach in Transformer, not pure `TP`. To confirm: During the decoding phase of the inference...

[BUG] Fp8 Runtime Error: "bad any_cast"

can you provide more information, the compile enviroment such as CUDA version and hardware info? we support FP8. Don't know why it fails.