Joosep Pata comments

Results 115 comments of


                                            Joosep Pata

pytorch model in CMSSW

ONNX export now works for the GNN-LSH since https://github.com/jpata/particleflow/pull/215. Need to train a model with pytorch and test that the import in CMSSW gives reasonable results.

https://indico.cern.ch/event/1388888/contributions/5839133/attachments/2821898/4928058/2024_03_18%20ML%20production%20news.pdf here's some material on how to integrate pytorch models directly via torchscript, rather than via ONNX.

pytorch model in CMSSW

Couple of notes: - exporting the pytorch model to ONNX with dynamic axes does not actually produce a model that can be evaluated with dynamically sized inputs, because MHA uses...

pytorch model in CMSSW

Here's the summary of today. It's possible to export the model (both quantized and unquantized) with dynamic shapes using `torch.onnx.export` in #324. However, `scaled_dot_product_attention` creates the inefficient fully unrolled attention...

pytorch model in CMSSW

Here's a potential example how to write the model by hand using onnxscript: https://github.com/microsoft/onnxruntime/issues/19924#issue-2187484945

pytorch model in CMSSW

Here's how the unfused vs. fused MHA looks like based on the example above

pytorch model in CMSSW

With this code ```python import torch import time import onnxruntime import pathlib import onnxscript import onnx import math import numpy dtype_map = { numpy.dtype("float32"): onnx.TensorProto.FLOAT, numpy.dtype("bool"): onnx.TensorProto.BOOL, } class Model(torch.nn.Module):...

pytorch model in CMSSW

It looks as if the onnxruntime.transformer optimizer, specifically FusionAttention: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_attention.py#L712 should replace the attention block with the MultiHeadAttention that on GPU should support flash attention.

pytorch model in CMSSW

If I can replace this part (SDPA only): with this (ignore the shapes): ![image](https://github.com/jpata/particleflow/assets/69717/8ab8a2f9-1619-4041-a969-acaebc1f4299) then in principle it should be possible to try flash attention on the ONNX model.

pytorch model in CMSSW

Converting the model with the fused attention layer `com.microsoft.MultiheadAttention` to float16 does run flash attention on A100 with the expected speed and memory improvement. The following code has batch size...