Cody Yu issues

Results 22 issues of


                                            Cody Yu

[OPT] 1-D OPT model with fused MHA

This PR adds an 1-D OPT model implementation. See `test_1d.py` for model inputs and output. Some highlights: 1. The model uses the fused kernel from `ft_mha`. The corresponding repo should...

[Feature] Support multiple executables in OPT serving

The initial support of multiple executables to cover various input prompt lengths. Usage (note that this is only applicable to autoregressive mode): ``` python benchmark_text_gen.py --model alpa/opt-2.7b --path ... --multi-executable...

[Bug] XlaBuilder is already registered when working with HuggingFace Trainer

## 🐛 Bug ## To Reproduce Steps to reproduce the behavior: 1. Install torch_xla https://github.com/pytorch/xla/commit/d8db50a778a39fab0a58436307a3225a6ca06f67. 2. Install HuggingFace transformers https://github.com/huggingface/transformers/commit/06a6a4bd516f7d0ba7c4966a2d3d9c0bf07797ae 3. Run the following: ```python >>> from transformers import TrainingArguments...

bug

[Bug] torch_xla._XLAC._xla_get_devices() hangs at the second call

## 🐛 Bug ## To Reproduce Steps to reproduce the behavior: Note that I intentionally didn't set XLA configuration. ```python >>> import torch_xla >>> torch_xla._XLAC._xla_get_devices() Traceback (most recent call last):...

bug

Problem about DecodedLocalVariableOp.localVariable

Hi, I'm a new user of this package. I'm now trying to retrieve the information of local variables as the sample you show on the website http://stephane.godbillon.com/BytecodeParser/. The problem I...

Slow nn.Linear on MI250

Hi there, I tried to benchmark the performance of `nn.Linear` in AI Template on MI250 GPU and compared with rocBLAS. I expected AI Template should achieve a much higher throughput,...

OOM after flush_cache when flashinfer is enabled

When benchmarking my serving workloads, I found the following pattern will constantly cause OOM error: 1. Launch a container with SRT and flashinfer enabled. 2. Benchmark with 800 requests. 3....

Crash in `tokenize_fast_forward`

I was trying constraint decoding with Qwen but got crash at this line: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/router/infer_batch.py#L60 This is because the output type of Qwen `tokenizer.convert_ids_to_tokens(...)` is not `str` but `bytes`, so the...

[Misc][Refactor] Introduce ExecuteModelData

The current `execute_model()` interface accepts the following arguments: ``` seq_group_metadata_list: List[SequenceGroupMetadata], blocks_to_swap_in: Dict[int, int], blocks_to_swap_out: Dict[int, int], blocks_to_copy: Dict[int, List[int]], num_lookahead_slots: int, ``` Since this interface is used by many...

[RFC]: Refactor FP8 kv-cache

### Motivation. **Support float8_e4m3 for NVIDIA GPUs:** The current FP8 kv-cache supports e5m2 on NVIDIA GPUs, and e4m3 on AMD GPUs. While e5m2 seems to be an ideal format for...

RFC