Renato Golin

Results 30 issues of Renato Golin

After #3361 and #3394, there's only an OpenSSL implementation of TLS in CCF. However, because we have MbedTLS as the previous implementation, and we wanted to play safe on the...

This is a bit of an odd one, but I'm at odds with vim-script (which I'm not very good at). So, we use Git worktrees, and we automated in a...

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and...

Currently, we're selecting our optimal blocking on the command line, with default `{2,8}` that is optimal for 16 threads. On our benchmarks, we pick the best one for each number...

Tests and benchmarks all work fine, except the ones using compiler packing (both FP32 and BF16). ``` Benchmark: prepacked_targets gemm_fp32_dnn_target : 79.273 gflops gemm_bf16_dnn_target : 256.180 gflops mlp_fp32_dnn_target : 78.956...

As noted here: https://github.com/libxsmm/libxsmm-dnn/issues/29#issuecomment-1871502920

Today, we make wrong packing decisions based on types (ex. bf16 always means vnni) instead of target support. We also make a [compile-time decision](https://github.com/plaidml/tpp-mlir/blob/main/lib/TPP/VNNIUtils.cpp#L24) about the packing shapes, which is...

Most benchmarks we have run for seconds, but the MHA one is consistently over 6min. I'm not sure this is something in the compiler (some eager pass, or unoptimized constant...

The PyTorch models we have in the benchmarks get a left-over `xsmm.zero` for the entire (unpacked) input in addition to the one inside the loop (that gets converted to beta=0...

enhancement
good first issue
low-priority