Driss Guessous

Results 183 comments of Driss Guessous

So getting errors on this diff stack: https://github.com/pytorch/pytorch/pull/142281 tried rebasing / reinstalling lintrunner and cleaning from source. But still get hundreds of errors for this rule. Noob question, I have...

Actually I think it has to do with how lintrunner only checks modifications? If I run `lintrunner torch/_inductor/wrapper_benchmark.py` on main I get the errors I am seeing the same errors

I dont think there is any current example, just generated one by calling flex_attention w/ `"BACKNED" = "FLASH` = ```Py # kernel path: /tmp/torchinductor_dev/y4/cy4su5k46zgriqpkqosly2lalii733le3qyfgzilhvjzbncfvvyc.py # Topologically Sorted Source Nodes: [flex_attention],...

1. This is probably kind of true today, but not sure if it will always be 2. For the most part we do, I think if people plan to use...

Ohh sorry I totally meant that as a question, tbh I dont totally understand the value of safe-tensors vs weights_only = True. My understanding is that you don't want pickle...

We should also take a look at the new blockwise fp8 gemm added in cutlass 3.7 cc @alexsamardzic

whats the output code look like on rasberry pi? just the normal cpp codegen?

# FP8 vs MXFP8 Benchmark Comparison ## References - **MXFP8**: https://fburl.com/s2g726a1 `CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.compile --training.steps 150 --model.converters mx --mx.recipe_name "mxfp8" --profiling.enable_profiling` `step: 70 loss: 6.9682 memory: 35.94GiB(20.15%) tps:...

AFAIK there exists goes in inductor that can do prologue fusion and actually get speed ups. For decode size with small activations and large weights this can be faster if...