Murali Andoorveedu comments

Results 50 comments of


                                            Murali Andoorveedu

[Distributed] Add P2P versions of *object_list operations

Hey @wconstab gentle ping on this one :)

[Distributed] Add P2P versions of *object_list operations

Hey @albanD @wconstab any update on this one? Would anyone be available to review this?

[Distributed] Add P2P versions of *object_list operations

Is there anything else blocking this PR? From what I can tell test failures seem to be unrelated

[Distributed] Add P2P versions of *object_list operations

Whoops let me fix that

[Distributed] Add P2P versions of *object_list operations

Hey @albanD would you be able to help me fix the tags/reviewer list and kick off a new test? Messed up with the rebase a little bit.

[Core] Pipeline Parallel Support

@youkaichao Yes for sure, it is one of the TODO items above

[Core] Pipeline Parallel Support

Updated the RFC here: https://github.com/vllm-project/vllm/issues/4461 @youkaichao Let me know if anything needs further elaboration

[Core] Pipeline Parallel Support

FYI pretty sure PyTorch has a bug, filed here: https://github.com/pytorch/pytorch/issues/125079 Worked around this last week by making sending and receiving phase for each model atomic by concatenating residuals and hidden...

[Core] Pipeline Parallel Support

Sounds good @youkaichao, I can update mine once that's merged. Will you also include the change to create the multiple CPU TP groups or should I create a separate PR?

[Core] Pipeline Parallel Support

Sounds good - I'll revert the PyNCCL changes on this PR and wait for that to be merged to add in