AITemplate icon indicating copy to clipboard operation
AITemplate copied to clipboard

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Results 178 AITemplate issues
Sort by recently updated
recently updated
newest added

`float16` is not CPU-friendly and `float32` input is unnecessarily large (if we are to add data marshaling). I usually pass the input as bytes (uint8), then convert to float16 inside...

Summary: This diffs adds an optimizations to the sorted graph to skip element-wise int operation is that no-ops, i.e. multiplied or divided by 0 or add or subtracted by 0...

CLA Signed
fb-exported

Hi, when I run 05_stable_diffusion# python3 src/benchmark.py, there is a error: pt output: torch.Size([1, 77, 1024]) [gemm_rcr_bias_add_25.cu] Got cutlass error: Error Internal at: 214 [20:21:02] model_interface.cu:221: Error: [gemm_rcr_bias_add_25.cu] Got cutlass...

When attempting to build the docker image as per the README: ``` git clone --recursive https://github.com/facebookincubator/AITemplate cd AITemplate ./docker/build.sh cuda ``` The image fails to build with the below error:...

Hi team, Thank you for your nice work! I met the error during inference. Stable-diffusion 1.5 in sample file. ``` [11:48:42] model_container.cu:87: Init AITemplate Runtime with 1 concurrency [11:48:42] model_container.cu:69:...

I have below code, which concats the input tensor before conv layer, this is the original corresponding Pytorch code: `x = torch.nn.functional.pad(x, pad, mode="constant", value=0)`. In AIT, assume the input...

Summary: `log1p(x)` is more precise than `log(1+x)` when `x` is close to 0. We utilize cuda `log1pf` implementation for fp32. For other precision types, input is first converted to float,...

CLA Signed
fb-exported

Differential Revision: D54332190

CLA Signed
fb-exported

Hi AIT team: I'm working on compiling a generative video model into AIT. I can successfully compile the model, as you can see here: ``` 2024-02-09 07:20:35,614 INFO max_blob=19546740864 constant_offset=7630531776...

So say if I have two AIT convered models, `model0` on `cuda0` and `model1` on `cuda1`. Even if I used `cudaSetDevice` to load the models properly on each cuda device,...