laser icon indicating copy to clipboard operation
laser copied to clipboard

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...

Results 19 laser issues
Sort by recently updated
recently updated
newest added

as requested, I am opening an issue. [somalier](https://github.com/brentp/somalier) calculates relatedness between pairs of samples using bitwise operations and popcounts [here](https://github.com/brentp/somalier/blob/a6facc7a366e67f657a089ba1b0a7c723cef7fb0/src/somalierpkg/bitset.nim#L20-L31) where genotypes is effectively: ``` type genotypes* = tuple[hom_ref:seq[uint64], het:seq[uint64],...

With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update): ``` A matrix shape:...

The `fp_reduction_latency` benchmarks were the very first benchmark, optimization and primitive code tested in Laser. Unfortunately it is currently very confusing. It should be reorganized: - 1. Multiple accumulators: https://github.com/numforge/laser/blob/af191c086b4a98c49049ecf18f5519dc6856cc77/benchmarks/fp_reduction_latency/reduction_bench.nim...

This issues track multithreading solution for JIT code. ## Description At the moment, Lux only target Nim and so can make use of OpenMP for threading. In the future, Lux...

Most HPC system have more than 1 socket which poses quite a problem to many parallel libraries. Even in OpenMP 4, distributing parallel compute to socket proc_bind(spread) and within sockets...

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512) After chatting for hours with @mratsim to find benchmark Laser with a 72 thread...

## python ``` time python $timn_D/tests/nim/all/t0147.py 1000.0 python $timn_D/tests/nim/all/t0147.py 5.26s user 0.13s system 293% cpu 1.840 total ``` ```py import numpy as np p=1000 a=np.ones((p,p)) b=np.ones((p,p)) for i in np.arange(100):...

```nim import pkg/laser/primitives/matrix_multiplication/gemm #[ error: /tmp/nim/nimcache/laser_gemm_ukernel_avx.c:416:10: error: always_inline function '_mm256_setzero_pd' requires target feature 'xsave', but would be inlined into function 'gebb_ukernel_float64_x86_AVX_Ecs27YPxbc6EG9arud9a0ZTQ' that is compiled without support for 'xsave' AB0_0 =...

With #20, the parallel schedule seems to scale perfectly on many cores: ``` $ OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away) A matrix...

Currently the way to implement fast sigmoid would be: ```Nim var x = randomTensor([1000, 1000], 1.0) var output = newTensor[float64](x.shape) forEach o in output, xi in x: o = 1...

enhancement