Joosep Pata
Joosep Pata
ONNX export now works for the GNN-LSH since https://github.com/jpata/particleflow/pull/215. Need to train a model with pytorch and test that the import in CMSSW gives reasonable results.
https://indico.cern.ch/event/1388888/contributions/5839133/attachments/2821898/4928058/2024_03_18%20ML%20production%20news.pdf here's some material on how to integrate pytorch models directly via torchscript, rather than via ONNX.
Couple of notes: - exporting the pytorch model to ONNX with dynamic axes does not actually produce a model that can be evaluated with dynamically sized inputs, because MHA uses...
Here's the summary of today. It's possible to export the model (both quantized and unquantized) with dynamic shapes using `torch.onnx.export` in #324. However, `scaled_dot_product_attention` creates the inefficient fully unrolled attention...
Here's a potential example how to write the model by hand using onnxscript: https://github.com/microsoft/onnxruntime/issues/19924#issue-2187484945
Here's how the unfused vs. fused MHA looks like based on the example above
With this code ```python import torch import time import onnxruntime import pathlib import onnxscript import onnx import math import numpy dtype_map = { numpy.dtype("float32"): onnx.TensorProto.FLOAT, numpy.dtype("bool"): onnx.TensorProto.BOOL, } class Model(torch.nn.Module):...
It looks as if the onnxruntime.transformer optimizer, specifically FusionAttention: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_attention.py#L712 should replace the attention block with the MultiHeadAttention that on GPU should support flash attention.
If I can replace this part (SDPA only): with this (ignore the shapes):  then in principle it should be possible to try flash attention on the ONNX model.
Converting the model with the fused attention layer `com.microsoft.MultiheadAttention` to float16 does run flash attention on A100 with the expected speed and memory improvement. The following code has batch size...