Stas Bekman

Results 664 comments of Stas Bekman

Perhaps going forward it'd be the easiest to report tflops not as a single number? but something like `415(bf16) 490(fp8)` - then fp4, mxfp4, etc. can be added as well...

I personally don't use MFU as it's already a BS number since [100% is unachievable](https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-flops) and moreover the achievable efficiency wildly varies between gpus. So if you move from B200...

yup, but as I suggested above it's probably best not to average since an important signal will be lost if you have one of the dtypes more inferior implementation than...

Yes, that's why I'm suggesting a breakdown report. Users should care a lot about reported TFLOPs/s and try to improve those. If they don't it will cost them $$ and...

tokens/s is also a very vague metric other than for local relative comparisons and even then one has to be very careful - this number alone is meaningless. Due to...

or even: ``` python -c "from torchtune.modules import TiedLinear" ``` which is what's really needed to use `torchtune.training._activation_offloading.OffloadActivations` I'm on pt-2.4 at the moment. At the very least torchtune could...

Understood! Would it too difficult to do a runtime check for pytorch version in `__init__.py` and tell the user if there is a mismatch and what exact version is required?...

The failing code because the version is wrong will not have a warning next to it explaining to the user what is wrong. Therefore an assert would be by far...

Oh, thank you very much for explaining, Joe. I first understood that the latest version of pytorch is required. Normally in all other frameworks I worked on we tested at...

> I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added I...