AITemplate
AITemplate copied to clipboard
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
From @terrychenism "the group norm problem size is not supported yet." My diff: ``` diff --git a/examples/05_stable_diffusion/compile.py b/examples/05_stable_diffusion/compile.py index 513df5b..790f3c0 100644 --- a/examples/05_stable_diffusion/compile.py +++ b/examples/05_stable_diffusion/compile.py @@ -177,8 +177,8 @@ def...
WIP PR, DO NOT USE NOW. Updates: - New Xformer attention codegen, >20% speed up on Stable Diffusion - New Xformer dual gemm codegen - Various new utility codegen -...
Want to be able to use stablediffusion at different precision, starting with fp32 for the simplest case python scripts/compile.py --dtype=float32 --use-fp16-acc=False currently stuck at https://gist.github.com/benjibc/838476ee6b5ff326eb6a94ef87b31cd2
This change extends _fuse_split_and_strided_op to also optimize split followed by cat (when both are on the same dim). The split op is removed and the input_accessors of the cat op...
Repro steps: ``` docker exec -it bash cd AITemplate/examples/05_stable_diffusion pip install accelerate python3 scripts/compile_alt.py --local-dir tmp/diffusers-pipeline/stabilityai/stable-diffusion-v2/ ``` Errors after a while with: ``` Traceback (most recent call last): File "scripts/compile_alt.py",...
Summary: in progress. Some unit tests have started finish successfully on an AWS machine, both Linux and Windows one. use `AIT_USE_CMAKE_COMPILATION=1` environment flag # Linux * AWS g4dn.xlarge with 24GB...
Hello, I'm running the benchmarking tools on BERT. For the sequence length = [1,2,4,8,64,128,384], it worked well. However, if I choose sequence length = [512, 1024, 4096], it failed even...