Driss Guessous comments

Results 183 comments of


                                            Driss Guessous

`set_linter` finds and replaces built-in set in Python code

So getting errors on this diff stack: https://github.com/pytorch/pytorch/pull/142281 tried rebasing / reinstalling lintrunner and cleaning from source. But still get hundreds of errors for this rule. Noob question, I have...

`set_linter` finds and replaces built-in set in Python code

Actually I think it has to do with how lintrunner only checks modifications? If I run `lintrunner torch/_inductor/wrapper_benchmark.py` on main I get the errors I am seeing the same errors

No support for 4D attention? `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)`

I dont think there is any current example, just generated one by calling flex_attention w/ `"BACKNED" = "FLASH` = ```Py # kernel path: /tmp/torchinductor_dev/y4/cy4su5k46zgriqpkqosly2lalii733le3qyfgzilhvjzbncfvvyc.py # Topologically Sorted Source Nodes: [flex_attention],...

Distributing ao tensor subclasses in .safetensors checkpoints

1. This is probably kind of true today, but not sure if it will always be 2. For the most part we do, I think if people plan to use...

Distributing ao tensor subclasses in .safetensors checkpoints

Ohh sorry I totally meant that as a question, tbh I dont totally understand the value of safe-tensors vs weights_only = True. My understanding is that you don't want pickle...

[float8] Add support for blockwise fp8 quantization scheme used in DeepSeek v3

We should also take a look at the new blockwise fp8 gemm added in cutlass 3.7 cc @alexsamardzic

torchao already works on raspberry pi

whats the output code look like on rasberry pi? just the normal cpp codegen?

MX single node performance tracker

# FP8 vs MXFP8 Benchmark Comparison ## References - **MXFP8**: https://fburl.com/s2g726a1 `CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.compile --training.steps 150 --model.converters mx --mx.recipe_name "mxfp8" --profiling.enable_profiling` `step: 70 loss: 6.9682 memory: 35.94GiB(20.15%) tps:...

float8 inference weight-only quant should map to a fused kernel or explain why not

AFAIK there exists goes in inductor that can do prologue fusion and actually get speed ups. For decode size with small activations and large weights this can be faster if...

float8 inference weight-only quant should map to a fused kernel or explain why not

Yeah, fat fingered