Cade Daniel

Results 121 comments of Cade Daniel

There's an inefficient allocation during spec decode which can cause OOM when paired with a large batch size. I lowered the batch size in that test, it passes locally for...

@richardliaw seems it’s waiting for compute. I found this out by opening the buildkite link.

idea seems good to me. the block manager v2 will soon support the notion of a null block. we can extend it to allocate such a null block even when...

What does EFA support currently look like? cc @lw

> What does EFA support currently look like? cc @lw I see this issue https://github.com/pytorch/pytorch/issues/65022, looks like because EFA doesn't implement all of the Infiniband features, TensorPipe fails on EFA....

Alright, thanks @lw! I may be interested in contributing a backend which works on EFA, will reach out if so.

I'm looking into this in my spare time. I've reproduced it and, interestingly, the issue only repros when the task takes an argument _and_ returns data. Furthermore, the amount of...

> This breaks some tests > > FYI @jiwq this is reverted