laser issues

performance of avx512 bit ops and popcounts

4

as requested, I am opening an issue. [somalier](https://github.com/brentp/somalier) calculates relatedness between pairs of samples using bitwise operations and popcounts [here](https://github.com/brentp/somalier/blob/a6facc7a366e67f657a089ba1b0a7c723cef7fb0/src/somalierpkg/bitset.nim#L20-L31) where genotypes is effectively: ``` type genotypes* = tuple[hom_ref:seq[uint64], het:seq[uint64],...

brentp

Mysterious 2x perf regression on GEMM

2

With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update): ``` A matrix shape:...

mratsim

[Benchmarks] Cleanup fp_reduction_latency benchmarks

The `fp_reduction_latency` benchmarks were the very first benchmark, optimization and primitive code tested in Laser. Unfortunately it is currently very confusing. It should be reorganized: - 1. Multiple accumulators: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/benchmarks/fp_reduction_latency/reduction_bench.nim...

mratsim

[Lux] Multithreading for JIT code

This issues track multithreading solution for JIT code. ## Description At the moment, Lux only target Nim and so can make use of OpenMP for threading. In the future, Lux...

mratsim

NUMA-aware memory allocation and computation

Most HPC system have more than 1 socket which poses quite a problem to many parallel libraries. Even in OpenMP 4, distributing parallel compute to socket proc_bind(spread) and within sockets...

mratsim

Benchmark example using Intel MKL (for history)

1

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512) After chatting for hours with @mratsim to find benchmark Laser with a 72 thread...

Laurae2

performance of gemm_strided vs numpy

1

## python ``` time python $timn_D/tests/nim/all/t0147.py 1000.0 python $timn_D/tests/nim/all/t0147.py 5.26s user 0.13s system 293% cpu 1.840 total ``` ```py import numpy as np p=1000 a=np.ones((p,p)) b=np.ones((p,p)) for i in np.arange(100):...

timotheecour

gemm_strided: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave'

1

```nim import pkg/laser/primitives/matrix_multiplication/gemm #[ error: /tmp/nim/nimcache/laser_gemm_ukernel_avx.c:416:10: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave', but would be inlined into function 'gebb_ukernel_float64_x86_AVX_Ecs27YPxbc6EG9arud9a0ZTQ' that is compiled without support for 'xsave' AB0_0 =...

timotheecour

[GEMM] Enhance serial implementation

1

With #20, the parallel schedule seems to scale perfectly on many cores: ``` $ OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away) A matrix...

mratsim

Fused assignation shortcut

Currently the way to implement fast sigmoid would be: ```Nim var x = randomTensor([1000, 1000], 1.0) var output = newTensor[float64](x.shape) forEach o in output, xi in x: o = 1...

mratsim

enhancement

laser
laser copied to clipboard

Metadata

performance of avx512 bit ops and popcounts

Mysterious 2x perf regression on GEMM

[Benchmarks] Cleanup fp_reduction_latency benchmarks

[Lux] Multithreading for JIT code

NUMA-aware memory allocation and computation

Benchmark example using Intel MKL (for history)

performance of gemm_strided vs numpy

gemm_strided: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave'

[GEMM] Enhance serial implementation

Fused assignation shortcut

← Metadata

Owner

Metadata

laser laser copied to clipboard

Metadata

← Metadata

Owner

Metadata

laser
laser copied to clipboard