XiongfeiWei
XiongfeiWei
Somehow, I still couldn't see my TPU CI running (Is it because all the tests are run in sequence and a CI before the TPU CI gets stuck and blocks...
The failing CI seems to have the symptom of timeout. I don't see why my PR would cause that.
Thanks @mgoin . I also did some check on my a100 VM. For the 2 failing tests: - VLLM_USE_V1=1 pytest -s -vv tests/mq_llm_engine/test_error_handling.py::test_mp_crash_detection: it fails on the main branch (4c33d6732148fdaeb9780fa86fca1f87f2a93c19)...
The problematic line is `o_ref[:, q_head_idx, :] = acc_scratch_ref[:].astype(o_ref.dtype)`. I found a way to work around the problem (the code is in https://github.com/jax-ml/jax/issues/24415). But I'm trying to figure out why...
It seems the assignee is not set when I use the link https://github.com/google/jax/issues/new?assignees=apaszke in the error message to create the issue. So manually cc @apaszke
Thanks Justin for the explanation!
> NUM_KV_PAGES_PER_BLOCK is no longer used in tpu_model_runner.py after this change. Is that intentional? Yea, NUM_KV_PAGES_PER_BLOCK is used for padding. Since we don't need to pad anymore, we no longer...
Thanks @mgoin for the review!
> The CI error is a bit tricky to solve. > > **Problem:** I'm using some CUDA functions defined inside PyTorch, which requires linking _libc10_cuda.so_ to the test binaries. However,...
For the problem 1 "Problem1: C++ test binaries need all references to be resolved", you mentioned the "Solution: Create a fallback implementation of the CUDA functions". Could you point to...