Renato Golin issues

Results 30 issues of


                                            Renato Golin

Re-work TLS to match OpenSSL's style

After #3361 and #3394, there's only an OpenSSL implementation of TLS in CCF. However, because we have MbedTLS as the previous implementation, and we wanted to play safe on the...

Changing to dark theme should automatically change the code highlighting to dark

feature-for-upstream

Dynamically setting up vim-project

This is a bit of an odd one, but I'm at odds with vim-script (which I'm not very good at). So, we use Git worktrees, and we automated in a...

Performance variation in single thread benchmark execution

Need to profile what's going on here. 99% of the time is spent on libxsmm calls, so why the large variation and why the compiler is "faster" on Zen and...

Make 2D parallelization a run time choice

Currently, we're selecting our optimal blocking on the command line, with default `{2,8}` that is optimal for 16 threads. On our benchmarks, we pick the best one for each number...

Graviton 3 packing not working

Tests and benchmarks all work fine, except the ones using compiler packing (both FP32 and BF16). ``` Benchmark: prepacked_targets gemm_fp32_dnn_target : 79.273 gflops gemm_bf16_dnn_target : 256.180 gflops mlp_fp32_dnn_target : 78.956...

Update libxsmm-dnn with new argument for VNNI^T

As noted here: https://github.com/libxsmm/libxsmm-dnn/issues/29#issuecomment-1871502920

Create a "target description" class for target-specific decisions

Today, we make wrong packing decisions based on types (ex. bf16 always means vnni) instead of target support. We also make a [compile-time decision](https://github.com/plaidml/tpp-mlir/blob/main/lib/TPP/VNNIUtils.cpp#L24) about the packing shapes, which is...

MHA benchmarks are taking too long

Most benchmarks we have run for seconds, but the MHA one is consistently over 6min. I'm not sure this is something in the compiler (some eager pass, or unoptimized constant...

PyTorch with `xsmm.zero` left-over before input online packing

The PyTorch models we have in the benchmarks get a left-over `xsmm.zero` for the entire (unpacked) input in addition to the one inside the loop (that gets converted to beta=0...

enhancement

good first issue

low-priority