simdtutor issues

小彭老师，我想请教一下下面这段代码为什么在-O2的情况下，avx2的版本比sse2的版本性能要差呢？（x86 gcc version 11.2.1）

// sse2 version template inline bool bytescompare(const Char* a, const Char* b, size_t n) { size_t offset = 0; size_t offset_end = n / 16 * 16; #ifdef __SSE2__ for...

AJ-mider

请问小彭老师，这段GPU代码为什么加速比这么低？

4

测试环境：笔记本R7-5800H，3060，Win11，MSVC最新版Release模式。测试结果： GPU time: 0.0018809 CPU time: 0.0048002 ratio: 2.55208 我用其它的CUDA程序加速比都能达到10倍左右，这个加速比为什么这么慢？（另外，改成float加速就很快，为什么？如果一定要用double，该怎么改？） ``` #include #include #include #include #include "cuda_runtime.h" #include "device_launch_parameters.h" #define TYPE double #define imgW 2448 #define imgH...

balleb6545anickk

请教小彭老师，我要怎么用SSE优化这个程序呢

6

https://github.com/Obj4ct/Image 我需要优化RGBYUV.cpp这个程序这是我自己优化的代码,我想知道哪里出现了问题呢: ``` #include // RGB到YUV的转换 void RGB2YUV(std::vector &imageData, int width, int height) { int numPixels = width * height * 3; int numProcessedPixels = numPixels - (numPixels %...

Obj4ct

请教小彭老师一下，这个函数怎么用avx2优化比较好

4

saunlesuanle

lin0ww0nil

小彭老师，麻烦使用SSE优化一下下面的代码

5

**以下的calSimilarity函数在算法过程中会执行（-180°~180°步长为0.1）很多次，下面时从工程中摘出来的代码，运行单次时，在编译器开O0优化，耗时为2046.41ms；在编译器开O3优化，耗时为310.06ms** `gcc normal_calSimilarity.cpp -O0 -o normal_calSimilarity_gcc -lstdc++ -lm` > 硬件 CPU: i7-10700 [email protected] x 16** ``` //******************************************************************************************** //***********************************normal_calSimilarity.cpp******************************** //******************************************************************************************** #include #include #include #include #include #include struct templateFeat { int x;...

zhangnatha

麻烦小彭老师优化一下下

8

``` // 以下是项目中的一段热点代码 // 会调用这个函数很多次，用来计算亚像素的像素值，方法是插值（具体算法可以不用管），里头的具体魔数我改了一下，因为是公司的代码，而且跟优化没有关系 double GetSubPixelValue(const double* preCaculatedParameter, int width, int height, double X, double Y) { int xIndex[6], yIndex[6]; xIndex[0] = int(X + 0.5) - 2; yIndex[0] =...

balleb6545anickk

请教：数组索引连续化的性能问题

15

我又来打扰了。我记得小彭老师说过，连续的内存访问速度比较快。可是我的高维数组进行抽提时，低维也有很长片段的连续，为啥还是很慢。需要用什么 stream 或者 prefetch 来优化吗？ ```python # array0 是一个体像素(voxel)，对其进行3D上的crop。crop 的尺寸为 64x64x64 array2 = np.ascontiguousarray(array0[:9, x:x+64, y:y+64, z:z+64]) #280fps, 4.9 GByte/s array3 = C++版本的memcopy(array, x, y, z, ....) #同样 280fps ```...

chenxinfeng4

c++17 std copy并行拷贝vector问题，resize如何不要零初始化

1

dstVec每次都要resize，相当于全拷贝了一次了，在做copy就没意义了，reverse也不能直接设置size，有没有办法去掉这个赋值，比如直接new char[]这种方式，不会默认填充值 std::vector srcVec(300 * 1024 * 1024, 'a'); std::vector dstVec; dstVec.resize(srcVec.size()); std::copy(std::execution::par, srcVec.begin(), srcVec.end(), dstVec.begin());

qq1174159858

小彭老师大大，有关mpm的流体模拟函数优化

8

void Simulate() { // CLEAR GRID std::size_t grid_size = grid.size(); // 确保 grid_size 不超出 int 的范围 #pragma omp parallel for for (int i = 0; i < static_cast(grid_size); ++i) {...

dd123-a

simdtutor
simdtutor copied to clipboard

Metadata

小彭老师，我想请教一下下面这段代码为什么在-O2的情况下，avx2的版本比sse2的版本性能要差呢？（x86 gcc version 11.2.1）

请问小彭老师，这段GPU代码为什么加速比这么低？

请教小彭老师，我要怎么用SSE优化这个程序呢

请教小彭老师一下，这个函数怎么用avx2优化比较好

小彭老师，麻烦使用SSE优化一下下面的代码

麻烦小彭老师优化一下下

请教：数组索引连续化的性能问题

c++17 std copy并行拷贝vector问题，resize如何不要零初始化

小彭老师大大，有关mpm的流体模拟函数优化

← Metadata

Owner

Metadata

simdtutor simdtutor copied to clipboard

Metadata

← Metadata

Owner

Metadata

simdtutor
simdtutor copied to clipboard