Vitaliy Chiley

Results 8 issues of Vitaliy Chiley

In [torchvision's RandomResizeCrop](https://pytorch.org/vision/stable/generated/torchvision.transforms.RandomResizedCrop.html#randomresizedcrop) and [Resize](https://pytorch.org/vision/stable/generated/torchvision.transforms.Resize.html#resize) allow the user to specify the interpolation. This isn't an option in [ffcv's RandomResizedCrop](https://github.com/libffcv/ffcv/blob/main/ffcv/transforms/random_resized_crop.py#L24) nor is it documented which interpolation is being used. Could this...

uses https://github.com/mosaicml/llm-foundry/pull/147 as a springboard to updt torch In interactive instance, I installed torch2 req and everything works fine 125M models was getting good (the same) MFU from the same...

I fork triton and rename it at `triton_pre_mlri`, triton diff [here](https://github.com/openai/triton/compare/main...vchiley:triton:triton_pre_mlir) llmfoundry/models/layers/flash_attn_triton.py is copy pasta from [HazyResearch flash_attn_triton](https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py) where I modify imports to be ``` import triton_pre_mlir as triton import...

ran `composer train/train.py train/yamls/pretrain/mpt-3b.yaml` also with `model.fc_type=te` and `precision=amp_fp8` Result: ``` torch: throughput/device/tokens_per_sec: 23.7k te: throughput/device/tokens_per_sec: 23.7k te with fp8: throughput/device/tokens_per_sec: 29.4k ``` Note there does seem to be this...

- ~~Clean up \_\_init__ / param init for FusedExpertsNetwork~~ done in another PR - ~~enable FusedExpertsNetwork to run without bias~~ done in another PR - make _num_global_experts not a buffer...

[This example](https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_amp.py) shows how to use TutelMoE with Torch autocast amp. Q: Is the All2All precision still meant to be done in FP32? In general torch autocast amp keeps network...

My interpretation of [get_custom_L2](https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py#L66) is that L2 decay is applied not to the individual weights being trained, but instead to the deploy equivalent weights. If this is the motivation, wouldn't...

[here](https://github.com/tomaarsen/attention_sinks#fluency-during-subsequent-prompting-for-chat-style-llms) > For MPT-7B-chat, a RuntimeError is encountered for transformers when the input length exceeds 2048. Can you comment on what the RuntimeError was? You have ran [mpt7b with seq...