Thien Tran
Thien Tran
Quant-LLM code: https://github.com/pytorch/ao/tree/main/torchao/csrc/cuda/fp6_llm Currently Quant-LLM kernel (backing FPx in torchao) only works with FP16. This creates a small divergence from other quantization methods, which all work with BF16. Since all...
In `optim.load_state_dict(state_dict)`, if optim dtype != state_dict dtype, `aten._to_copy.default` is called. This PR simply implements this op and add appropriate tests. **Update**: In PyTorch pre-2.4, calling `.to(device, dtype)` will not...
https://github.com/pytorch/ao/tree/main/torchao/prototype/quantized_training Currently INT8 training recipes only support **row-wise scaling** for weight. This should be strictly better than (or at least the same as) **tensor-wise scaling** for weight in terms of...
#### Context What is the purpose of this PR? Is it to - [x] add a new feature - [ ] fix a bug - [ ] update tests and/or...
**Steps/Code to reproduce bug** ```python import torch import cutlass.epilogue def epilogue(accum, bias): D = accum + bias return D examples_tensors = dict( accum=torch.randn(1024, 1024), bias=torch.randn(1024, 1).bfloat16(), D=torch.randn(1024, 1024).bfloat16(), ) cutlass.epilogue.trace(epilogue,...
Fixes #1824 I was thinking of adding a test case for this, but currently the dtype is hard-coded to FP16 https://github.com/NVIDIA/cutlass/blob/44dae8b90ef232ea663727470dfbbe9daff6972d/test/python/cutlass/evt/utils/evt_testbed.py#L206 Would take some refactoring to test multiple dtypes at...
### Feature request Add BetterTransformer support for SEW. SEW has almost identical architecture with Wav2Vec2. In particular, the attention modules are the same. ### Motivation NA ### Your contribution I'm...
## Pull Request Description When `stream=true`, OpenAI API does not require `stream_options` to be specified. This will work ``` curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY"...
## Describe Your Changes - Note: remote engine refactor should be merged before this PR TODO: - [ ] Use Ollama API (`/api/chat`) / Ollama client to set context length...
## Describe Your Changes Replace cortex's `/v1/hardware` with rust - [x] Basic hardware info: CPU, OS, RAM usage (Power and Storage are removed since they are not implemented in Cortex,...